Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
gzip(1) encodes the OS it runs on in the 10th byte of its output. It
uses the following OS_CODE values according to its tailor.h [1]:
0 - MS-DOS
3 - UNIX
5 - Atari ST
6 - OS/2
10 - TOPS-20
11 - Windows NT
The gzip.exe that comes with Git for Windows uses OS_CODE 3 for some
reason, so this value is used on practically all supported platforms
when generating tgz archives using gzip(1).
Zlib uses a bigger set of values according to its zutil.h [2], aligned
with section 4.4.2 of the ZIP specification, APPNOTE.txt [3]:
0 - MS-DOS
1 - Amiga
3 - UNIX
4 - VM/CMS
5 - Atari ST
6 - OS/2
7 - Macintosh
8 - Z-System
10 - Windows NT
11 - MVS (OS/390 - Z/OS)
13 - Acorn Risc
16 - BeOS
18 - OS/400
19 - OS X (Darwin)
Thus the internal gzip implementation in archive-tar.c sets different
OS_CODE header values on major platforms Windows and macOS. Git for
Windows uses its own zlib-based variant since v2.20.1 by default and
thus embeds OS_CODE 10 in tgz archives.
The tar archive for a commit is generated consistently on all systems
(by the same Git version). The OS_CODE in the gzip header does not
influence extraction. Avoid leaking OS information and make tgz
archives constistent and reproducable (with the same Git and libz
versions) by using OS_CODE 3 everywhere.
At least on macOS 12.4 this produces the same output as gzip(1) for the
examples I tried:
# before
$ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
3abbffb40b7c63cf9b7d91afc682f11682f80759 -
# with this patch
$ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
dc6dc6ba9636d522799085d0d77ab6a110bcc141 -
$ git archive --format=tar v2.36.0 | gzip -cn | shasum
dc6dc6ba9636d522799085d0d77ab6a110bcc141 -
[1] https://git.savannah.gnu.org/cgit/gzip.git/tree/tailor.h
[2] https://github.com/madler/zlib/blob/master/zutil.h
[3] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git uses zlib for its own object store, but calls gzip when creating tgz
archives. Add an option to perform the gzip compression for the latter
using zlib, without depending on the external gzip binary.
Plug it in by making write_block a function pointer and switching to a
compressing variant if the filter command has the magic value "git
archive gzip". Does that indirection slow down tar creation? Not
really, at least not in this test:
$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
'./git -C ../linux archive --format=tar HEAD # {rev}'
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
Time (mean ± σ): 4.044 s ± 0.007 s [User: 3.901 s, System: 0.137 s]
Range (min … max): 4.038 s … 4.059 s 10 runs
Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
Time (mean ± σ): 4.047 s ± 0.009 s [User: 3.903 s, System: 0.138 s]
Range (min … max): 4.038 s … 4.066 s 10 runs
How does tgz creation perform?
$ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
Time (mean ± σ): 20.404 s ± 0.006 s [User: 23.943 s, System: 0.401 s]
Range (min … max): 20.395 s … 20.414 s 10 runs
Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
Time (mean ± σ): 23.807 s ± 0.023 s [User: 23.655 s, System: 0.145 s]
Range (min … max): 23.782 s … 23.857 s 10 runs
Summary
'./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
So the internal implementation takes 17% longer on the Linux repo, but
uses 2% less CPU time. That's because the external gzip can run in
parallel on its own processor, while the internal one works sequentially
and avoids the inter-process communication overhead.
What are the benefits? Only an internal sequential implementation can
offer this eco mode, and it allows avoiding the gzip(1) requirement.
This implementation uses the helper functions from our zlib.c instead of
the convenient gz* functions from zlib, because the latter doesn't give
the control over the generated gzip header that the next patch requires.
Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All tar archive writes have the same size and are done to the same file
descriptor. Move them to a common function, write_block(), to reduce
code duplication and make it easy to change the destination.
Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The void pointer "data" in struct archiver is only used to store filter
commands to pass tar archives to, like gzip. Rename it accordingly and
also turn it into a char pointer to document the fact that it's a string
reference.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mention all formats in the --format section, use backtick quoting for
literal values throughout, clarify the description of the configuration
option.
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A new bug() and BUG_if_bug() API is introduced to make it easier to
uniformly log "detect multiple bugs and abort in the end" pattern.
* ab/bug-if-bug:
cache-tree.c: use bug() and BUG_if_bug()
receive-pack: use bug() and BUG_if_bug()
parse-options.c: use optbug() instead of BUG() "opts" check
parse-options.c: use new bug() API for optbug()
usage.c: add a non-fatal bug() function to go with BUG()
common-main.c: move non-trace2 exit() behavior out of trace2.c
More fsmonitor--daemon.
* jh/builtin-fsmonitor-part3: (30 commits)
t7527: improve implicit shutdown testing in fsmonitor--daemon
fsmonitor--daemon: allow --super-prefix argument
t7527: test Unicode NFC/NFD handling on MacOS
t/lib-unicode-nfc-nfd: helper prereqs for testing unicode nfc/nfd
t/helper/hexdump: add helper to print hexdump of stdin
fsmonitor: on macOS also emit NFC spelling for NFD pathname
t7527: test FSMonitor on case insensitive+preserving file system
fsmonitor: never set CE_FSMONITOR_VALID on submodules
t/perf/p7527: add perf test for builtin FSMonitor
t7527: FSMonitor tests for directory moves
fsmonitor: optimize processing of directory events
fsm-listen-darwin: shutdown daemon if worktree root is moved/renamed
fsm-health-win32: force shutdown daemon if worktree root moves
fsm-health-win32: add polling framework to monitor daemon health
fsmonitor--daemon: stub in health thread
fsmonitor--daemon: rename listener thread related variables
fsmonitor--daemon: prepare for adding health thread
fsmonitor--daemon: cd out of worktree root
fsm-listen-darwin: ignore FSEvents caused by xattr changes on macOS
unpack-trees: initialize fsmonitor_has_run_once in o->result
...
A misconfigured 'branch..remote' led to a bug in configuration
parsing.
* gc/zero-length-branch-config-fix:
remote.c: reject 0-length branch names
remote.c: don't BUG() on 0-length branch names
Rename .env_array member to .env in the child_process structure.
* ab/env-array:
run-command API users: use "env" not "env_array" in comments & names
run-command API: rename "env_array" to "env"
With a more targetted workaround in http.c in another topic, we may
be able to lift this blanket "GCC12 dangling-pointer warning is
broken and unsalvageable" workaround.
* cb/buggy-gcc-12-workaround:
Revert -Wno-error=dangling-pointer
"git clone --origin X" leaked piece of memory that held value read
from the clone.defaultRemoteName configuration variable, which has
been plugged.
source: <xmqqlevl4ysk.fsf@gitster.g>
* jc/clone-remote-name-leak-fix:
clone: plug a miniscule leak
The path taken by "git multi-pack-index" command from the end user
was compared with path internally prepared by the tool withut first
normalizing, which lead to duplicated paths not being noticed,
which has been corrected.
source: <pull.1221.v2.git.1650911234.gitgitgadget@gmail.com>
* ds/midx-normalize-pathname-before-comparison:
cache: use const char * for get_object_directory()
multi-pack-index: use --object-dir real path
midx: use real paths in lookup_multi_pack_index()
"git rebase --keep-base <upstream> <branch-to-rebase>" computed the
commit to rebase onto incorrectly, which has been corrected.
source: <20220421044233.894255-1-alexhenrie24@gmail.com>
* ah/rebase-keep-base-fix:
rebase: use correct base for --keep-base when a branch is given
Avoid problems from interaction between malloc_check and address
sanitizer.
source: <pull.1210.git.1649507317350.gitgitgadget@gmail.com>
* pw/test-malloc-with-sanitize-address:
tests: make SANITIZE=address imply TEST_NO_MALLOC_CHECK
The commit summary shown after making a commit is matched to what
is given in "git status" not to use the break-rewrite heuristics.
source: <c35bd0aa-2e46-e710-2b39-89f18bad0097@web.de>
* rs/commit-summary-wo-break-rewrite:
commit, sequencer: turn off break_opt for commit summary
macOS CI jobs have been occasionally flaky due to tentative version
skew between perforce and the homebrew packager. Instead of
failing the whole CI job, just let it skip the p4 tests when this
happens.
source: <20220512223940.238367-1-gitster@pobox.com>
* cb/ci-make-p4-optional:
ci: use https, not http to download binaries from perforce.com
ci: reintroduce prevention from perforce being quarantined in macOS
ci: avoid brew for installing perforce
ci: make failure to find perforce more user friendly
A bit of test framework fixes with a few fixes to issues found by
valgrind.
source: <20220512223218.237544-1-gitster@pobox.com>
* ab/valgrind-fixes:
commit-graph.c: don't assume that stat() succeeds
object-file: fix a unpack_loose_header() regression in 3b6a8db3b0
log test: skip a failing mkstemp() test under valgrind
tests: using custom GIT_EXEC_PATH breaks --valgrind tests
"git archive --add-file=<path>" picked up the raw permission bits
from the path and propagated to zip output in some cases, without
normalization, which has been corrected (tar output did not have
this issue).
source: <xmqqmtfme8v6.fsf@gitster.g>
* jc/archive-add-file-normalize-mode:
archive: do not let on-disk mode leak to zip archives
The "--current" option of "git show-branch" should have been made
incompatible with the "--reflog" mode, but this was not enforced,
which has been corrected.
source: <xmqqh76mf7s4.fsf_-_@gitster.g>
* jc/show-branch-g-current:
show-branch: -g and --current are incompatible
Meant to go with js/ci-gcc-12-fixes.
source: <xmqq7d68ytj8.fsf_-_@gitster.g>
* jc/http-clear-finished-pointer:
http.c: clear the 'finished' member once we are done with it
Fixes real problems noticed by gcc 12 and works around false
positives.
source: <pull.1238.git.1653351786.gitgitgadget@gmail.com>
* js/ci-gcc-12-fixes:
dir.c: avoid "exceeds maximum object size" error with GCC v12.x
nedmalloc: avoid new compile error
compat/win32/syslog: fix use-after-realloc
A git subcommand like "git add -p" spawns a separate git process
while relaying its command line arguments. A pathspec with only
negative elements was mistakenly passed with an empty string, which
has been corrected.
* jc/all-negative-pathspec:
pathspec: correct an empty string used as a pathspec element
Implementation of "scalar diagnose" subcommand.
* js/scalar-diagnose:
scalar: teach `diagnose` to gather loose objects information
scalar: teach `diagnose` to gather packfile info
scalar diagnose: include disk space information
scalar: implement `scalar diagnose`
scalar: validate the optional enlistment argument
archive --add-virtual-file: allow paths containing colons
archive: optionally add "virtual" files
The documentation on the interaction between "--add-file" and
"--prefix" options of "git archive" has been improved.
* rs/document-archive-prefix:
archive: improve documentation of --prefix
Leakfix.
* fh/transport-push-leakfix:
transport: free local and remote refs in transport_push()
transport: unify return values and exit point from transport_push()
transport: remove unnecessary indenting in transport_push()
Update the GitHub workflow support to make it quicker to get to the
failing test.
* js/ci-github-workflow-markup:
ci: call `finalize_test_case_output` a little later
ci(github): mention where the full logs can be found
ci: use `--github-workflow-markup` in the GitHub workflow
ci(github): avoid printing test case preamble twice
ci(github): skip the logs of the successful test cases
ci: optionally mark up output in the GitHub workflow
ci/run-build-and-tests: add some structure to the GitHub workflow output
ci: make it easier to find failed tests' logs in the GitHub workflow
ci/run-build-and-tests: take a more high-level view
test(junit): avoid line feeds in XML attributes
tests: refactor --write-junit-xml code
ci: fix code style
Plug the memory leaks from the trickiest API of all, the revision
walker.
* ab/plug-leak-in-revisions: (27 commits)
revisions API: add a TODO for diff_free(&revs->diffopt)
revisions API: have release_revisions() release "topo_walk_info"
revisions API: have release_revisions() release "date_mode"
revisions API: call diff_free(&revs->pruning) in revisions_release()
revisions API: release "reflog_info" in release revisions()
revisions API: clear "boundary_commits" in release_revisions()
revisions API: have release_revisions() release "prune_data"
revisions API: have release_revisions() release "grep_filter"
revisions API: have release_revisions() release "filter"
revisions API: have release_revisions() release "cmdline"
revisions API: have release_revisions() release "mailmap"
revisions API: have release_revisions() release "commits"
revisions API users: use release_revisions() for "prune_data" users
revisions API users: use release_revisions() with UNLEAK()
revisions API users: use release_revisions() in builtin/log.c
revisions API users: use release_revisions() in http-push.c
revisions API users: add "goto cleanup" for release_revisions()
stash: always have the owner of "stash_info" free it
revisions API users: use release_revisions() needing REV_INFO_INIT
revision.[ch]: document and move code declared around "init"
...
CMake updates.
* yw/cmake-updates:
cmake: remove (_)UNICODE def on Windows in CMakeLists.txt
cmake: add pcre2 support
cmake: fix CMakeLists.txt on Linux
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.
* tb/cruft-packs:
sha1-file.c: don't freshen cruft packs
builtin/gc.c: conditionally avoid pruning objects via loose
builtin/repack.c: add cruft packs to MIDX during geometric repack
builtin/repack.c: use named flags for existing_packs
builtin/repack.c: allow configuring cruft pack generation
builtin/repack.c: support generating a cruft pack
builtin/pack-objects.c: --cruft with expiration
reachable: report precise timestamps from objects in cruft packs
reachable: add options to add_unseen_recent_objects_to_traversal
builtin/pack-objects.c: --cruft without expiration
builtin/pack-objects.c: return from create_object_entry()
t/helper: add 'pack-mtimes' test-tool
pack-mtimes: support writing pack .mtimes files
chunk-format.h: extract oid_version()
pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
pack-mtimes: support reading .mtimes files
Documentation/technical: add cruft-packs.txt
Disable the "do not remove the directory the user started Git in"
logic when Git cannot tell where that directory is. Earlier we
refused to run in such a case.
* kl/setup-in-unreadable-worktree:
setup: don't die if realpath(3) fails on getcwd(3)
A workflow change for translators are being proposed.
* jx/l10n-workflow-change:
l10n: Document the new l10n workflow
Makefile: add "po-init" rule to initialize po/XX.po
Makefile: add "po-update" rule to update po/XX.po
po/git.pot: don't check in result of "make pot"
po/git.pot: this is now a generated file
Makefile: remove duplicate and unwanted files in FOUND_SOURCE_FILES
i18n CI: stop allowing non-ASCII source messages in po/git.pot
Makefile: have "make pot" not "reset --hard"
Makefile: generate "po/git.pot" from stable LOCALIZED_C
Makefile: sort source files before feeding to xgettext
Teach "git repack --geometric" work better with "--keep-pack" and
avoid corrupting the repository when packsize limit is used.
* tb/geom-repack-with-keep-and-max:
builtin/repack.c: ensure that `names` is sorted
t7703: demonstrate object corruption with pack.packSizeLimit
repack: respect --keep-pack with geometric repack
"sparse-checkout" learns to work well with the sparse-index
feature.
* ds/sparse-sparse-checkout:
sparse-checkout: integrate with sparse index
p2000: add test for 'git sparse-checkout [add|set]'
sparse-index: complete partial expansion
sparse-index: partially expand directories
sparse-checkout: --no-sparse-index needs a full index
cache-tree: implement cache_tree_find_path()
sparse-index: introduce partially-sparse indexes
sparse-index: create expand_index()
t1092: stress test 'git sparse-checkout set'
t1092: refactor 'sparse-index contents' test
The multi-pack-index code did not protect the packfile it is going
to depend on from getting removed while in use, which has been
corrected.
* tb/midx-race-in-pack-objects:
builtin/pack-objects.c: ensure pack validity from MIDX bitmap objects
builtin/pack-objects.c: ensure included `--stdin-packs` exist
builtin/pack-objects.c: avoid redundant NULL check
pack-bitmap.c: check preferred pack validity when opening MIDX bitmap
Preliminary code refactoring around transport and bundle code.
* ds/bundle-uri:
bundle.h: make "fd" version of read_bundle_header() public
remote: allow relative_url() to return an absolute url
remote: move relative_url()
http: make http_get_file() external
fetch-pack: move --keep=* option filling to a function
fetch-pack: add a deref_without_lazy_fetch_extended()
dir API: add a generalized path_match_flags() function
connect.c: refactor sending of agent & object-format
Introduce a filesystem-dependent mechanism to optimize the way the
bits for many loose object files are ensured to hit the disk
platter.
* ns/batch-fsync:
core.fsyncmethod: performance tests for batch mode
t/perf: add iteration setup mechanism to perf-lib
core.fsyncmethod: tests for batch mode
test-lib-functions: add parsing helpers for ls-files and ls-tree
core.fsync: use batch mode and sync loose objects by default on Windows
unpack-objects: use the bulk-checkin infrastructure
update-index: use the bulk-checkin infrastructure
builtin/add: add ODB transaction around add_files_to_cache
cache-tree: use ODB transaction around writing a tree
core.fsyncmethod: batched disk flushes for loose-objects
bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'
bulk-checkin: rename 'state' variable and separate 'plugged' boolean
Deprecate non-cone mode of the sparse-checkout feature.
* en/sparse-cone-becomes-default:
Documentation: some sparsity wording clarifications
git-sparse-checkout.txt: mark non-cone mode as deprecated
git-sparse-checkout.txt: flesh out pattern set sections a bit
git-sparse-checkout.txt: add a new EXAMPLES section
git-sparse-checkout.txt: shuffle some sections and mark as internal
git-sparse-checkout.txt: update docs for deprecation of 'init'
git-sparse-checkout.txt: wording updates for the cone mode default
sparse-checkout: make --cone the default
tests: stop assuming --no-cone is the default mode for sparse-checkout
Follow-up on a preceding commit which changed all references to the
"env_array" when referring to the "struct child_process" member. These
changes are all unnecessary for the compiler, but help the code's
human readers.
All the comments that referred to "env_array" have now been updated,
as well as function names and variables that had "env_array" in their
name, they now refer to "env".
In addition the "out" name for the submodule.h prototype was
inconsistent with the function definition's use of "env_array" in
submodule.c. Both of them use "env" now.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Start following-up on the rename mentioned in c7c4bdeccf (run-command
API: remove "env" member, always use "env_array", 2021-11-25) of
"env_array" to "env".
The "env_array" name was picked in 19a583dc39 (run-command: add
env_array, an optional argv_array for env, 2014-10-19) because "env"
was taken. Let's not forever keep the oddity of "*_array" for this
"struct strvec", but not for its "args" sibling.
This commit is almost entirely made with a coccinelle rule[1]. The
only manual change here is in run-command.h to rename the struct
member itself and to change "env_array" to "env" in the
CHILD_PROCESS_INIT initializer.
The rest of this is all a result of applying [1]:
* make contrib/coccinelle/run_command.cocci.patch
* patch -p1 <contrib/coccinelle/run_command.cocci.patch
* git add -u
1. cat contrib/coccinelle/run_command.pending.cocci
@@
struct child_process E;
@@
- E.env_array
+ E.env
@@
struct child_process *E;
@@
- E->env_array
+ E->env
I've avoided changing any comments and derived variable names here,
that will all be done in the next commit.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>