Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
gzip(1) encodes the OS it runs on in the 10th byte of its output. It
uses the following OS_CODE values according to its tailor.h [1]:
0 - MS-DOS
3 - UNIX
5 - Atari ST
6 - OS/2
10 - TOPS-20
11 - Windows NT
The gzip.exe that comes with Git for Windows uses OS_CODE 3 for some
reason, so this value is used on practically all supported platforms
when generating tgz archives using gzip(1).
Zlib uses a bigger set of values according to its zutil.h [2], aligned
with section 4.4.2 of the ZIP specification, APPNOTE.txt [3]:
0 - MS-DOS
1 - Amiga
3 - UNIX
4 - VM/CMS
5 - Atari ST
6 - OS/2
7 - Macintosh
8 - Z-System
10 - Windows NT
11 - MVS (OS/390 - Z/OS)
13 - Acorn Risc
16 - BeOS
18 - OS/400
19 - OS X (Darwin)
Thus the internal gzip implementation in archive-tar.c sets different
OS_CODE header values on major platforms Windows and macOS. Git for
Windows uses its own zlib-based variant since v2.20.1 by default and
thus embeds OS_CODE 10 in tgz archives.
The tar archive for a commit is generated consistently on all systems
(by the same Git version). The OS_CODE in the gzip header does not
influence extraction. Avoid leaking OS information and make tgz
archives constistent and reproducable (with the same Git and libz
versions) by using OS_CODE 3 everywhere.
At least on macOS 12.4 this produces the same output as gzip(1) for the
examples I tried:
# before
$ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
3abbffb40b7c63cf9b7d91afc682f11682f80759 -
# with this patch
$ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
dc6dc6ba9636d522799085d0d77ab6a110bcc141 -
$ git archive --format=tar v2.36.0 | gzip -cn | shasum
dc6dc6ba9636d522799085d0d77ab6a110bcc141 -
[1] https://git.savannah.gnu.org/cgit/gzip.git/tree/tailor.h
[2] https://github.com/madler/zlib/blob/master/zutil.h
[3] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git uses zlib for its own object store, but calls gzip when creating tgz
archives. Add an option to perform the gzip compression for the latter
using zlib, without depending on the external gzip binary.
Plug it in by making write_block a function pointer and switching to a
compressing variant if the filter command has the magic value "git
archive gzip". Does that indirection slow down tar creation? Not
really, at least not in this test:
$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
'./git -C ../linux archive --format=tar HEAD # {rev}'
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
Time (mean ± σ): 4.044 s ± 0.007 s [User: 3.901 s, System: 0.137 s]
Range (min … max): 4.038 s … 4.059 s 10 runs
Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
Time (mean ± σ): 4.047 s ± 0.009 s [User: 3.903 s, System: 0.138 s]
Range (min … max): 4.038 s … 4.066 s 10 runs
How does tgz creation perform?
$ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
Time (mean ± σ): 20.404 s ± 0.006 s [User: 23.943 s, System: 0.401 s]
Range (min … max): 20.395 s … 20.414 s 10 runs
Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
Time (mean ± σ): 23.807 s ± 0.023 s [User: 23.655 s, System: 0.145 s]
Range (min … max): 23.782 s … 23.857 s 10 runs
Summary
'./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
So the internal implementation takes 17% longer on the Linux repo, but
uses 2% less CPU time. That's because the external gzip can run in
parallel on its own processor, while the internal one works sequentially
and avoids the inter-process communication overhead.
What are the benefits? Only an internal sequential implementation can
offer this eco mode, and it allows avoiding the gzip(1) requirement.
This implementation uses the helper functions from our zlib.c instead of
the convenient gz* functions from zlib, because the latter doesn't give
the control over the generated gzip header that the next patch requires.
Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All tar archive writes have the same size and are done to the same file
descriptor. Move them to a common function, write_block(), to reduce
code duplication and make it easy to change the destination.
Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The void pointer "data" in struct archiver is only used to store filter
commands to pass tar archives to, like gzip. Rename it accordingly and
also turn it into a char pointer to document the fact that it's a string
reference.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change a pattern of hardcoding an "argv" array size, populating it and
assigning to the "argv" member of "struct child_process" to instead
use "strvec_push()" to add data to the "args" member.
As noted in the preceding commit this moves us further towards being
able to remove the "argv" member in a subsequent commit
These callers could have used strvec_pushl(), but moving to
strvec_push() makes the diff easier to read, and keeps the arguments
aligned as before.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead. It shortens the code and infers the
element size automatically.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Compression programs like zip, gzip, bzip2 and xz allow to adjust the
trade-off between CPU cost and size gain with numerical options from -1
for fast compression and -9 for high compression ratio. zip also
accepts -0 for storing files verbatim. git archive directly support
these single-digit compression levels for ZIP output and passes them to
filters like gzip.
Zstandard additionally supports compression level options -10 to -19, or
up to -22 with --ultra. This *seems* to work with git archive in most
cases, e.g. it will produce an archive with -19 without complaining, but
since it only supports single-digit compression level options this is
the same as -1 -9 and thus -9.
Allow git archive to accept multi-digit compression levels to support
the full range supported by zstd. Explicitly reject them for the ZIP
format, as otherwise deflateInit2() would just fail with a somewhat
cryptic "stream consistency error".
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Centralize reading of symlink destinations and the contents of regular
files that are too small to be streamed. This reduces code duplication
and allows future patches to add support for adding non-tracked files to
archives. The backends are expected to stream blobs if buffer is NULL.
object_file_to_archive() is only called from archive.c and thus no
longer exported.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We return the length to a subset of a string using an "int *"
out-parameter. This is fine most of the time, as we'd expect config keys
to be relatively short, but it could behave oddly if we had a gigantic
config key. A more appropriate type is size_t.
Let's switch over, which lets our callers use size_t as appropriate
(they are bound by our type because they must pass the out-parameter as
a pointer). This is mostly just a cleanup to make it clear this code
handles long strings correctly. In practice, our config parser already
chokes on long key names (because of a similar int/size_t mixup!).
When doing an int/size_t conversion, we have to be careful that nobody
was trying to assign a negative value to the variable. I manually
confirmed that for each case here. They tend to just feed the result to
xmemdupz() or similar; in a few cases I adjusted the parameter types for
helper functions to make sure the size_t is preserved.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some callers of open_istream() at archive-tar.c and archive-zip.c are
capable of working on arbitrary repositories but the repo struct is not
passed down to open_istream(), which uses the_repository internally. For
now, that's not a problem since the said callers are only being called
with the_repository. But to be consistent and avoid future problems,
let's allow open_istream() to receive a struct repository and use that
instead of the_repository. This parameter addition will also be used in
a future patch to make sha1-file.c:check_object_signature() be able to
work on arbitrary repos.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git archive" recorded incorrect length in extended pax header in
some corner cases, which has been corrected.
* rs/pax-extended-header-length-fix:
archive-tar: turn length miscalculation warning into BUG
archive-tar: use size_t in strbuf_append_ext_header()
archive-tar: fix pax extended header length calculation
archive-tar: report wrong pax extended header length
Now that we're confident our pax extended header calculation is correct,
turn the criticality of the assertion up to the maximum, from warning
right up to BUG. Simplify the test, as the stderr comparison step would
not be reached in case the BUG message is triggered.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
One of its callers already passes in a size_t value. Use it
consistently in this function.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A pax extended header record starts with a decimal number. Its value
is the length of the whole record, including its own length.
The calculation of that number in strbuf_append_ext_header() is off by
one in case the length of the rest is close to a higher order of
magnitude. This affects paths and link targets a bit shorter than 1000,
10000, 100000 etc. characters -- paths with a length of up to 100 fit
into the tar header and don't need a pax extended header.
The mistake has been present since the function was added by ae64bbc18c
("tar-tree: Introduce write_entry()", 2006-03-25).
Account for digits added to len during the loop and keep incrementing
until we have enough space for len and the rest. The crucial change is
to check against the current value of len before each iteration, instead
of against its value before the loop.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Extended header entries contain a length value that is a bit tricky to
calculate because it includes its own length (number of decimal digits)
as well. We get it wrong in corner cases. Add a check, report wrong
results as a warning and add a test for exercising it.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change the commit_sha1 member to be called "commit_oid" and change it to
be a pointer to struct object_id. Additionally, update some uses of
GIT_SHA1_HEXSZ and hard-coded values to use the_hash_algo instead.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We indent with TABs and sometimes for fine alignment, TABs followed by
spaces, but never all spaces (unless the indentation is less than 8
columns). Indenting with spaces slips through in some places. Fix
them.
Imported code and compat/ are left alone on purpose. The former should
remain as close as upstream as possible. The latter pretty much has
separate maintainers, it's up to them to decide.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When printing variables which contain a size, today "unsigned long"
is used at many places.
In order to be able to change the type from "unsigned long" into size_t
some day in the future, we need to have a way to print 64 bit variables
on a system that has "unsigned long" defined to be 32 bit, like Win64.
Upcast all those variables into uintmax_t before they are printed.
This is to prepare for a bigger change, when "unsigned long"
will be converted into size_t for variables which may be > 4Gib.
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The more library-ish parts of the codebase learned to work on the
in-core index-state instance that is passed in by their callers,
instead of always working on the singleton "the_index" instance.
* nd/no-the-index: (24 commits)
blame.c: remove implicit dependency on the_index
apply.c: remove implicit dependency on the_index
apply.c: make init_apply_state() take a struct repository
apply.c: pass struct apply_state to more functions
resolve-undo.c: use the right index instead of the_index
archive-*.c: use the right repository
archive.c: avoid access to the_index
grep: use the right index instead of the_index
attr: remove index from git_attr_set_direction()
entry.c: use the right index instead of the_index
submodule.c: use the right index instead of the_index
pathspec.c: use the right index instead of the_index
unpack-trees: avoid the_index in verify_absent()
unpack-trees: convert clear_ce_flags* to avoid the_index
unpack-trees: don't shadow global var the_index
unpack-trees: add a note about path invalidation
unpack-trees: remove 'extern' on function declaration
ls-files: correct index argument to get_convert_attr_ascii()
preload-index.c: use the right index instead of the_index
dir.c: remove an implicit dependency on the_index in pathspec code
...
Many more strings are prepared for l10n.
* nd/i18n: (23 commits)
transport-helper.c: mark more strings for translation
transport.c: mark more strings for translation
sha1-file.c: mark more strings for translation
sequencer.c: mark more strings for translation
replace-object.c: mark more strings for translation
refspec.c: mark more strings for translation
refs.c: mark more strings for translation
pkt-line.c: mark more strings for translation
object.c: mark more strings for translation
exec-cmd.c: mark more strings for translation
environment.c: mark more strings for translation
dir.c: mark more strings for translation
convert.c: mark more strings for translation
connect.c: mark more strings for translation
config.c: mark more strings for translation
commit-graph.c: mark more strings for translation
builtin/replace.c: mark more strings for translation
builtin/pack-objects.c: mark more strings for translation
builtin/grep.c: mark strings for translation
builtin/config.c: mark more strings for translation
...
With 'struct archive_args' gaining new repository pointer, we don't
have to assume the_repository in the archive backends anymore.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The conversion to pass "the_repository" and then "a_repository"
throughout the object access API continues.
* sb/object-store-grafts:
commit: allow lookup_commit_graft to handle arbitrary repositories
commit: allow prepare_commit_graft to handle arbitrary repositories
shallow: migrate shallow information into the object parser
path.c: migrate global git_path_* to take a repository argument
cache: convert get_graft_file to handle arbitrary repositories
commit: convert read_graft_file to handle arbitrary repositories
commit: convert register_commit_graft to handle arbitrary repositories
commit: convert commit_graft_pos() to handle arbitrary repositories
shallow: add repository argument to is_repository_shallow
shallow: add repository argument to check_shallow_file_for_update
shallow: add repository argument to register_shallow
shallow: add repository argument to set_alternate_shallow_file
commit: add repository argument to lookup_commit_graft
commit: add repository argument to prepare_commit_graft
commit: add repository argument to read_graft_file
commit: add repository argument to register_commit_graft
commit: add repository argument to commit_graft_pos
object: move grafts to object parser
object-store: move object access functions to object-store.h
Developer support update, by using BUG() macro instead of die() to
mark codepaths that should not happen more clearly.
* js/use-bug-macro:
BUG_exit_code: fix sparse "symbol not declared" warning
Convert remaining die*(BUG) messages
Replace all die("BUG: ...") calls by BUG() ones
run-command: use BUG() to report bugs, not die()
test-tool: help verifying BUG() code paths
This should make these functions easier to find and cache.h less
overwhelming to read.
In particular, this moves:
- read_object_file
- oid_object_info
- write_object_file
As a result, most of the codebase needs to #include object-store.h.
In this patch the #include is only added to files that would fail to
compile otherwise. It would be better to #include wherever
identifiers from the header are used. That can happen later
when we have better tooling for it.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In d8193743e0 (usage.c: add BUG() function, 2017-05-12), a new macro
was introduced to use for reporting bugs instead of die(). It was then
subsequently used to convert one single caller in 588a538ae5
(setup_git_env: convert die("BUG") to BUG(), 2017-05-12).
The cover letter of the patch series containing this patch
(cf 20170513032414.mfrwabt4hovujde2@sigill.intra.peff.net) is not
terribly clear why only one call site was converted, or what the plan
is for other, similar calls to die() to report bugs.
Let's just convert all remaining ones in one fell swoop.
This trick was performed by this invocation:
sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c)
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a repository argument to allow the callers of oid_object_info
to be more specific about which repository to handle. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.
As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.
Signed-off-by: Stefan Beller <sbeller@google.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert sha1_object_info and sha1_object_info_extended to take pointers
to struct object_id and rename them to use "oid" instead of "sha1" in
their names. Update the declaration and definition and apply the
following semantic patch, plus the standard object_id transforms:
@@
expression E1, E2;
@@
- sha1_object_info(E1.hash, E2)
+ oid_object_info(&E1, E2)
@@
expression E1, E2;
@@
- sha1_object_info(E1->hash, E2)
+ oid_object_info(E1, E2)
@@
expression E1, E2, E3;
@@
- sha1_object_info_extended(E1.hash, E2, E3)
+ oid_object_info_extended(&E1, E2, E3)
@@
expression E1, E2, E3;
@@
- sha1_object_info_extended(E1->hash, E2, E3)
+ oid_object_info_extended(E1, E2, E3)
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert this function to take a pointer to struct object_id and rename
it object_file_to_archive.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert the write_archive_entry_fn_t type to use a pointer to struct
object_id. Convert various static functions in the tar and zip
archivers also.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix configuration codepath to pay proper attention to commondir
that is used in multi-worktree situation, and isolate config API
into its own header file.
* bw/config-h:
config: don't implicitly use gitdir or commondir
config: respect commondir
setup: teach discover_git_directory to respect the commondir
config: don't include config.h by default
config: remove git_config_iter
config: create config.h
Stop including config.h by default in cache.h. Instead only include
config.h in those files which require use of the config system.
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit dddbad728c ("timestamp_t: a new data type for timestamps",
26-04-2017) introduced a new typedef 'timestamp_t', as a synonym for an
unsigned long, which was used at the time to represent timestamps in
git. A later commit 28f4aee3fb ("use uintmax_t for timestamps",
26-04-2017) changed the typedef to use an 'uintmax_t' for the timestamp
representation type.
When building on a 32-bit Linux system, sparse complains that a constant
(USTAR_MAX_MTIME) used to detect a 'far-future mtime' timestamp, is too
large; 'warning: constant 077777777777UL is so big it is unsigned long
long' on lines 335 and 338 of archive-tar.c. Note that both gcc and
clang only issue a warning if this constant is used in a context that
requires an 'unsigned long' (rather than an uintmax_t). (Since TIME_MAX
is no longer equal to 0xFFFFFFFF, even on a 32-bit system, the macro
USTAR_MAX_MTIME is set to 077777777777UL, which cannot be represented as
an 'unsigned long' constant).
In order to suppress the warning, change the definition of the macro
constant USTAR_MAX_MTIME to use an 'ULL' type suffix.
In a similar vein, on systems which use a 64-bit representation of the
'unsigned long' type, the USTAR_MAX_SIZE constant macro is defined with
the value 077777777777ULL. Although this does not cause any warning
messages to be issued, it would be more appropriate for this constant
to use an 'UL' type suffix rather than 'ULL'.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git's source code assumes that unsigned long is at least as precise as
time_t. Which is incorrect, and causes a lot of problems, in particular
where unsigned long is only 32-bit (notably on Windows, even in 64-bit
versions).
So let's just use a more appropriate data type instead. In preparation
for this, we introduce the new `timestamp_t` data type.
By necessity, this is a very, very large patch, as it has to replace all
timestamps' data type in one go.
As we will use a data type that is not necessarily identical to `time_t`,
we need to be very careful to use `time_t` whenever we interact with the
system functions, and `timestamp_t` everywhere else.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function write_extended_header() only ever returns 0. Simplify
it and its caller by dropping its return value, like we did with
write_global_extended_header() earlier.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As we are not yet moving everything to size_t but still using ulong
internally when talking about the size of object, platforms with
32-bit long will not be able to produce tar archive with 4GB+ file,
and cannot grok 077777777777UL as a constant. Disable the extended
header feature and do not test it on them.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We never do any error checks, and so never return anything
but "0". Let's just drop this to simplify the code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ustar format represents timestamps as seconds since the
epoch, but only has room to store 11 octal digits. To
express anything larger, we need to use an extended header.
This is exactly the same case we fixed for the size field in
the previous commit, and the solution here follows the same
pattern.
This is even mentioned as an issue in f2f0267 (archive-tar:
use xsnprintf for trivial formatting, 2015-09-24), but since
it only affected things far in the future, it wasn't deemed
worth dealing with. But note that my calculations claiming
thousands of years were off there; because our xsnprintf
produces a NUL byte, we only have until the year 2242 to fix
this.
Given that this is just around the corner (geologically
speaking, anyway), and because it's easy to fix, let's just
make it work. Unlike the previous fix for "size", where we
had to write an individual extended header for each file, we
can write one global header (since we have only one mtime
for the whole archive).
There's a slight bit of trickiness there. We may already be
writing a global header with a "comment" field for the
commit sha1. So we need to write our new field into the same
header. To do this, we push the decision of whether to write
such a header down into write_global_extended_header(),
which will now assemble the header as it sees fit, and will
return early if we have nothing to write (in practice, we'll
only have a large mtime if it comes from a commit, but this
makes it also work if you set your system clock ahead such
that time() returns a huge value).
Note that we don't (and never did) handle negative
timestamps (i.e., before 1970). This would probably not be
too hard to support in the same way, but since git does not
support negative timestamps at all, I didn't bother here.
After writing the extended header, we munge the timestamp in
the ustar headers to the maximum-allowable size. This is
wrong, but it's the least-wrong thing we can provide to a
tar implementation that doesn't understand pax headers (it's
also what GNU tar does).
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ustar format has a fixed-length field for the size of
each file entry which is supposed to contain up to 11 bytes
of octal-formatted data plus a NUL or space terminator.
These means that the largest size we can represent is
077777777777, or 1 byte short of 8GB. The correct solution
for a larger file, according to POSIX.1-2001, is to add an
extended pax header, similar to how we handle long
filenames. This patch does that, and writes zero for the
size field in the ustar header (the last bit is not
mentioned by POSIX, but it matches how GNU tar behaves with
--format=pax).
This should be a strict improvement over the current
behavior, which is to die in xsnprintf with a "BUG".
However, there's some interesting history here.
Prior to f2f0267 (archive-tar: use xsnprintf for trivial
formatting, 2015-09-24), we silently overflowed the "size"
field. The extra bytes ended up in the "mtime" field of the
header, which was then immediately written itself,
overwriting our extra bytes. What that means depends on how
many bytes we wrote.
If the size was 64GB or greater, then we actually overflowed
digits into the mtime field, meaning our value was
effectively right-shifted by those lost octal digits. And
this patch is again a strict improvement over that.
But if the size was between 8GB and 64GB, then our 12-byte
field held all of the actual digits, and only our NUL
terminator overflowed. According to POSIX, there should be a
NUL or space at the end of the field. However, GNU tar seems
to be lenient here, and will correctly parse a size up 64GB
(minus one) from the field. So sizes in this range might
have just worked, depending on the implementation reading
the tarfile.
This patch is mostly still an improvement there, as the 8GB
limit is specifically mentioned in POSIX as the correct
limit. But it's possible that it could be a regression
(versus the pre-f2f0267 state) if all of the following are
true:
1. You have a file between 8GB and 64GB.
2. Your tar implementation _doesn't_ know about pax
extended headers.
3. Your tar implementation _does_ parse 12-byte sizes from
the ustar header without a delimiter.
It's probably not worth worrying about such an obscure set
of conditions, but I'm documenting it here just in case.
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit f2f0267 (archive-tar: use xsnprintf for trivial
formatting, 2015-09-24) converted cases of "sprintf" to
"xsnprintf", but accidentally left one as just "snprintf".
This meant that we could silently truncate the resulting
buffer instead of flagging an error.
In practice, this is impossible to achieve, as we are
formatting a ustar checksum, which can be at most 7
characters. But the point of xsnprintf is to document and
check for "should be impossible" conditions; this site was
just accidentally mis-converted during f2f0267.
Noticed-by: Paul Green <Paul.Green@stratus.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we generate tar headers, we sprintf() values directly
into a struct with the fixed-size header values. For the
most part this is fine, as we are formatting small values
(e.g., the octal format of "mode & 0x7777" is of fixed
length). But it's still a good idea to use xsnprintf here.
It communicates to readers what our expectation is, and it
provides a run-time check that we are not overflowing the
buffers.
The one exception here is the mtime, which comes from the
epoch time of the commit we are archiving. For sane values,
this fits into the 12-byte value allocated in the header.
But since git can handle 64-bit times, if I claim to be a
visitor from the year 10,000 AD, I can overflow the buffer.
This turns out to be harmless, as we simply overflow into
the chksum field, which is then overwritten.
This case is also best as an xsnprintf. It should never come
up, short of extremely malformed dates, and in that case we
are probably better off dying than silently truncating the
date value (and we cannot expand the size of the buffer,
since it is dictated by the ustar format). Our friends in
the year 5138 (when we legitimately flip to a 12-digit
epoch) can deal with that problem then.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We sometimes sprintf into fixed-size buffers when we know
that the buffer is large enough to fit the input (either
because it's a constant, or because it's numeric input that
is bounded in size). Likewise with strcpy of constant
strings.
However, these sites make it hard to audit sprintf and
strcpy calls for buffer overflows, as a reader has to
cross-reference the size of the array with the input. Let's
use xsnprintf instead, which communicates to a reader that
we don't expect this to overflow (and catches the mistake in
case we do).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This looks like a simple omission from 8539070 (archive-tar:
unindent write_tar_entry by one level, 2012-05-03).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This reverts commit 10f343ea81, whose
output is no longer bit-for-bit equivalent from the older versions
of Git, which the infrastructure to (pretend to) upload tarballs
kernel.org uses depends on.
Most struct child_process variables are cleared using memset first after
declaration. Provide a macro, CHILD_PROCESS_INIT, that can be used to
initialize them statically instead. That's shorter, doesn't require a
function call and is slightly more readable (especially given that we
already have STRBUF_INIT, ARGV_ARRAY_INIT etc.).
Helped-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>