Commit Graph

12895 Commits

Author SHA1 Message Date
Derrick Stolee
3797a0a7b7 maintenance: use Windows scheduled tasks
Git's background maintenance uses cron by default, but this is not
available on Windows. Instead, integrate with Task Scheduler.

Tasks can be scheduled using the 'schtasks' command. There are several
command-line options that can allow for some advanced scheduling, but
unfortunately these seem to all require authenticating using a password.

Instead, use the "/xml" option to pass an XML file that contains the
configuration for the necessary schedule. These XML files are based on
some that I exported after constructing a schedule in the Task Scheduler
GUI. These options only run background maintenance when the user is
logged in, and more fields are populated with the current username and
SID at run-time by 'schtasks'.

Since the GIT_TEST_MAINT_SCHEDULER environment variable allows us to
specify 'schtasks' as the scheduler, we can test the Windows-specific
logic on other platforms. Thus, add a check that the XML file written
by Git is valid when xmllint exists on the system.

Since we use a temporary file for the XML files sent to 'schtasks', we
prefix the random characters with the frequency so it is easier to
examine the proper file during tests. Instead of an exact match on the
'args' file, we 'grep' for the arguments other than the filename.

There is a deficiency in the current design. Windows has two kinds of
applications: GUI applications that start by "winmain()" and console
applications that start by "main()". Console applications are attached
to a new Console window if they are not already associated with a GUI
application. This means that every hour the scheudled task launches a
command window for the scheduled tasks. Not only is this visually
obtrusive, but it also takes focus from whatever else the user is
doing!

A simple fix would be to insert a GUI application that acts as a shim
between the scheduled task and Git. This is currently possible in Git
for Windows by setting the <Command> tag equal to

  C:\Program Files\Git\git-bash.exe

with options "--hide --no-needs-console --command=cmd\git.exe"
followed by the arguments currently used. Since git-bash.exe is not
included in Windows builds of core Git, I chose to leave out this
feature. My plan is to submit a small patch to Git for Windows that
converts the use of git.exe with this use of git-bash.exe in the
short term. In the long term, we can consider creating this GUI
shim application within core Git, perhaps in contrib/.

Co-authored-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-05 14:38:02 -08:00
Derrick Stolee
2afe7e3567 maintenance: use launchctl on macOS
The existing mechanism for scheduling background maintenance is done
through cron. The 'crontab -e' command allows updating the schedule
while cron itself runs those commands. While this is technically
supported by macOS, it has some significant deficiencies:

1. Every run of 'crontab -e' must request elevated privileges through
   the user interface. When running 'git maintenance start' from the
   Terminal app, it presents a dialog box saying "Terminal.app would
   like to administer your computer. Administration can include
   modifying passwords, networking, and system settings." This is more
   alarming than what we are hoping to achieve. If this alert had some
   information about how "git" is trying to run "crontab" then we would
   have some reason to believe that this dialog might be fine. However,
   it also doesn't help that some scenarios just leave Git waiting for
   a response without presenting anything to the user. I experienced
   this when executing the command from a Bash terminal view inside
   Visual Studio Code.

2. While cron initializes a user environment enough for "git config
   --global --show-origin" to show the correct config file information,
   it does not set up the environment enough for Git Credential Manager
   Core to load credentials during a 'prefetch' task. My prefetches
   against private repositories required re-authenticating through UI
   pop-ups in a way that should not be required.

The solution is to switch from cron to the Apple-recommended [1]
'launchd' tool.

[1] https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/ScheduledJobs.html

The basics of this tool is that we need to create XML-formatted
"plist" files inside "~/Library/LaunchAgents/" and then use the
'launchctl' tool to make launchd aware of them. The plist files
include all of the scheduling information, along with the command-line
arguments split across an array of <string> tags.

For example, here is my plist file for the weekly scheduled tasks:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>Label</key><string>org.git-scm.git.weekly</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/libexec/git-core/git</string>
<string>--exec-path=/usr/local/libexec/git-core</string>
<string>for-each-repo</string>
<string>--config=maintenance.repo</string>
<string>maintenance</string>
<string>run</string>
<string>--schedule=weekly</string>
</array>
<key>StartCalendarInterval</key>
<array>
<dict>
<key>Day</key><integer>0</integer>
<key>Hour</key><integer>0</integer>
<key>Minute</key><integer>0</integer>
</dict>
</array>
</dict>
</plist>

The schedules for the daily and hourly tasks are more complicated
since we need to use an array for the StartCalendarInterval with
an entry for each of the six days other than the 0th day (to avoid
colliding with the weekly task), and each of the 23 hours other
than the 0th hour (to avoid colliding with the daily task).

The "Label" value is currently filled with "org.git-scm.git.X"
where X is the frequency. We need a different plist file for each
frequency.

The launchctl command needs to be aligned with a user id in order
to initialize the command environment. This must be done using
the 'launchctl bootstrap' subcommand. This subcommand is new as
of macOS 10.11, which was released in September 2015. Before that
release the 'launchctl load' subcommand was recommended. The best
source of information on this transition I have seen is available
at [2]. The current design does not preclude a future version that
detects the available fatures of 'launchctl' to use the older
commands. However, it is best to rely on the newest version since
Apple might completely remove the deprecated version on short
notice.

[2] https://babodee.wordpress.com/2016/04/09/launchctl-2-0-syntax/

To remove a schedule, we must run 'launchctl bootout' with a valid
plist file. We also need to 'bootout' a task before the 'bootstrap'
subcommand will succeed, if such a task already exists.

The need for a user id requires us to run 'id -u' which works on
POSIX systems but not Windows. Further, the need for fully-qualitifed
path names including $HOME behaves differently in the Git internals and
the external test suite. The $HOME variable starts with "C:\..." instead
of the "/c/..." that is provided by Git in these subcommands. The test
therefore has a prerequisite that we are not on Windows. The cross-
platform logic still allows us to test the macOS logic on a Linux
machine.

We can verify the commands that were run by 'git maintenance start'
and 'git maintenance stop' by injecting a script that writes the
command-line arguments into GIT_TEST_MAINT_SCHEDULER.

An earlier version of this patch accidentally had an opening
"<dict>" tag when it should have had a closing "</dict>" tag. This
was caught during manual testing with actual 'launchctl' commands,
but we do not want to update developers' tasks when running tests.
It appears that macOS includes the "xmllint" tool which can verify
the XML format. This is useful for any system that might contain
the tool, so use it whenever it is available.

We strive to make these tests work on all platforms, but Windows caused
some headaches. In particular, the value of getuid() called by the C
code is not guaranteed to be the same as `$(id -u)` invoked by a test.
This is because `git.exe` is a native Windows program, whereas the
utility programs run by the test script mostly utilize the MSYS2 runtime,
which emulates a POSIX-like environment. Since the purpose of the test
is to check that the input to the hook is well-formed, the actual user
ID is immaterial, thus we can work around the problem by making the the
test UID-agnostic. Another subtle issue is the $HOME environment
variable being a Windows-style path instead of a Unix-style path. We can
be more flexible here instead of expecting exact path matches.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Co-authored-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-05 14:38:02 -08:00
Derrick Stolee
16c5690929 maintenance: include 'cron' details in docs
Advanced and expert users may want to know how 'git maintenance start'
schedules background maintenance in order to customize their own
schedules beyond what the maintenance.* config values allow. Start a new
set of sections in git-maintenance.txt that describe how 'cron' is used
to run these tasks.

This is particularly valuable for users who want to inspect what Git is
doing or for users who want to customize the schedule further. Having a
baseline can provide a way forward for users who have never worked with
cron schedules.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-24 13:02:29 -08:00
Derrick Stolee
0016b61818 maintenance: add troubleshooting guide to docs
The 'git maintenance run' subcommand takes a lock on the object database
to prevent concurrent processes from competing for resources. This is an
important safety measure to prevent possible repository corruption and
data loss.

This feature can lead to confusing behavior if a user is not aware of
it. Add a TROUBLESHOOTING section to the 'git maintenance' builtin
documentation that discusses these tradeoffs. The short version of this
section is that Git will not corrupt your repository, but if the list of
scheduled tasks takes longer than an hour then some scheduled tasks may
be dropped due to this object database collision. For example, a
long-running "daily" task at midnight might prevent an "hourly" task
from running at 1AM.

The opposite is also possible, but less likely as long as the "hourly"
tasks are much faster than the "daily" and "weekly" tasks.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-16 08:36:42 -07:00
Derrick Stolee
61f7a383d3 maintenance: use 'incremental' strategy by default
The 'git maintenance (register|start)' subcommands add the current
repository to the global Git config so maintenance will operate on that
repository. It does not specify what maintenance should occur or how
often.

To make it simple for users to start background maintenance with a
recommended schedlue, update the 'maintenance.strategy' config option in
both the 'register' and 'start' subcommands. This allows users to
customize beyond the defaults using individual
'maintenance.<task>.schedule' options, but also the user can opt-out of
this strategy using 'maintenance.strategy=none'.

Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-16 08:36:42 -07:00
Derrick Stolee
a4cb1a2339 maintenance: create maintenance.strategy config
To provide an on-ramp for users to use background maintenance without
several 'git config' commands, create a 'maintenance.strategy' config
option. Currently, the only important value is 'incremental' which
assigns the following schedule:

* gc: never
* prefetch: hourly
* commit-graph: hourly
* loose-objects: daily
* incremental-repack: daily

These tasks are chosen to minimize disruptions to foreground Git
commands and use few compute resources.

The 'maintenance.strategy' is intended as a baseline that can be
customzied further by manually assigning 'maintenance.<task>.enabled'
and 'maintenance.<task>.schedule' config options, which will override
any recommendation from 'maintenance.strategy'. This operates similarly
to config options like 'feature.experimental' which operate as "meta"
config options that change default config values.

This presents a way forward for updating the 'incremental' strategy in
the future or adding new strategies. For example, a potential strategy
could be to include a 'full' strategy that runs the 'gc' task weekly
and no other tasks by default.

Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-16 08:36:42 -07:00
Derrick Stolee
2fec604f8d maintenance: add start/stop subcommands
Add new subcommands to 'git maintenance' that start or stop background
maintenance using 'cron', when available. This integration is as simple
as I could make it, barring some implementation complications.

The schedule is laid out as follows:

  0 1-23 * * *   $cmd maintenance run --schedule=hourly
  0 0    * * 1-6 $cmd maintenance run --schedule=daily
  0 0    * * 0   $cmd maintenance run --schedule=weekly

where $cmd is a properly-qualified 'git for-each-repo' execution:

$cmd=$path/git --exec-path=$path for-each-repo --config=maintenance.repo

where $path points to the location of the Git executable running 'git
maintenance start'. This is critical for systems with multiple versions
of Git. Specifically, macOS has a system version at '/usr/bin/git' while
the version that users can install resides at '/usr/local/bin/git'
(symlinked to '/usr/local/libexec/git-core/git'). This will also use
your locally-built version if you build and run this in your development
environment without installing first.

This conditional schedule avoids having cron launch multiple 'git
for-each-repo' commands in parallel. Such parallel commands would likely
lead to the 'hourly' and 'daily' tasks competing over the object
database lock. This could lead to to some tasks never being run! Since
the --schedule=<frequency> argument will run all tasks with _at least_
the given frequency, the daily runs will also run the hourly tasks.
Similarly, the weekly runs will also run the daily and hourly tasks.

The GIT_TEST_CRONTAB environment variable is not intended for users to
edit, but instead as a way to mock the 'crontab [-l]' command. This
variable is set in test-lib.sh to avoid a future test from accidentally
running anything with the cron integration from modifying the user's
schedule. We use GIT_TEST_CRONTAB='test-tool crontab <file>' in our
tests to check how the schedule is modified in 'git maintenance
(start|stop)' commands.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:59:44 -07:00
Derrick Stolee
0c18b70081 maintenance: add [un]register subcommands
In preparation for launching background maintenance from the 'git
maintenance' builtin, create register/unregister subcommands. These
commands update the new 'maintenance.repos' config option in the global
config so the background maintenance job knows which repositories to
maintain.

These commands allow users to add a repository to the background
maintenance list without disrupting the actual maintenance mechanism.

For example, a user can run 'git maintenance register' when no
background maintenance is running and it will not start the background
maintenance. A later update to start running background maintenance will
then pick up this repository automatically.

The opposite example is that a user can run 'git maintenance unregister'
to remove the current repository from background maintenance without
halting maintenance for other repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:59:44 -07:00
Derrick Stolee
4950b2a2b5 for-each-repo: run subcommands on configured repos
It can be helpful to store a list of repositories in global or system
config and then iterate Git commands on that list. Create a new builtin
that makes this process simple for experts. We will use this builtin to
run scheduled maintenance on all configured repositories in a future
change.

The test is very simple, but does highlight that the "--" argument is
optional.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:59:44 -07:00
Derrick Stolee
b08ff1fee0 maintenance: add --schedule option and config
Maintenance currently triggers when certain data-size thresholds are
met, such as number of pack-files or loose objects. Users may want to
run certain maintenance tasks based on frequency instead. For example,
a user may want to perform a 'prefetch' task every hour, or 'gc' task
every day. To help these users, update the 'git maintenance run' command
to include a '--schedule=<frequency>' option. The allowed frequencies
are 'hourly', 'daily', and 'weekly'. These values are also allowed in a
new config value 'maintenance.<task>.schedule'.

The 'git maintenance run --schedule=<frequency>' checks the '*.schedule'
config value for each enabled task to see if the configured frequency is
at least as frequent as the frequency from the '--schedule' argument. We
use the following order, for full clarity:

	'hourly' > 'daily' > 'weekly'

Use new 'enum schedule_priority' to track these values numerically.

The following cron table would run the scheduled tasks with the correct
frequencies:

  0 1-23 * * *    git -C <repo> maintenance run --schedule=hourly
  0 0    * * 1-6  git -C <repo> maintenance run --schedule=daily
  0 0    * * 0    git -C <repo> maintenance run --schedule=weekly

This cron schedule will run --schedule=hourly every hour except at
midnight. This avoids a concurrent run with the --schedule=daily that
runs at midnight every day except the first day of the week. This avoids
a concurrent run with the --schedule=weekly that runs at midnight on
the first day of the week. Since --schedule=daily also runs the
'hourly' tasks and --schedule=weekly runs the 'hourly' and 'daily'
tasks, we will still see all tasks run with the proper frequencies.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:59:44 -07:00
Derrick Stolee
1942d48380 maintenance: optionally skip --auto process
Some commands run 'git maintenance run --auto --[no-]quiet' after doing
their normal work, as a way to keep repositories clean as they are used.
Currently, users who do not want this maintenance to occur would set the
'gc.auto' config option to 0 to avoid the 'gc' task from running.
However, this does not stop the extra process invocation. On Windows,
this extra process invocation can be more expensive than necessary.

Allow users to drop this extra process by setting 'maintenance.auto' to
'false'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:59:44 -07:00
Derrick Stolee
e841a79a13 maintenance: add incremental-repack auto condition
The incremental-repack task updates the multi-pack-index by deleting pack-
files that have been replaced with new packs, then repacking a batch of
small pack-files into a larger pack-file. This incremental repack is faster
than rewriting all object data, but is slower than some other
maintenance activities.

The 'maintenance.incremental-repack.auto' config option specifies how many
pack-files should exist outside of the multi-pack-index before running
the step. These pack-files could be created by 'git fetch' commands or
by the loose-objects task. The default value is 10.

Setting the option to zero disables the task with the '--auto' option,
and a negative value makes the task run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:05 -07:00
Derrick Stolee
52fe41ff1c maintenance: add incremental-repack task
The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'incremental-repack' task runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximates how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the incremental-repack task will delete these
   repacked pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
18e449f86b midx: enable core.multiPackIndex by default
The core.multiPackIndex setting has been around since c4d25228eb
(config: create core.multiPackIndex setting, 2018-07-12), but has been
disabled by default. If a user wishes to use the multi-pack-index
feature, then they must enable this config and run 'git multi-pack-index
write'.

The multi-pack-index feature is relatively stable now, so make the
config option true by default. For users that do not use a
multi-pack-index, the only extra cost will be a file lookup to see if a
multi-pack-index file exists (once per process, per object directory).

Also, this config option will be referenced by an upcoming
"incremental-repack" task in the maintenance builtin, so move the config
option into the repository settings struct. Note that if
GIT_TEST_MULTI_PACK_INDEX=1, then we want to ignore the config option
and treat core.multiPackIndex as enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
3e220e6069 maintenance: create auto condition for loose-objects
The loose-objects task deletes loose objects that already exist in a
pack-file, then place the remaining loose objects into a new pack-file.
If this step runs all the time, then we risk creating pack-files with
very few objects with every 'git commit' process. To prevent
overwhelming the packs directory with small pack-files, place a minimum
number of objects to justify the task.

The 'maintenance.loose-objects.auto' config option specifies a minimum
number of loose objects to justify the task to run under the '--auto'
option. This defaults to 100 loose objects. Setting the value to zero
will prevent the step from running under '--auto' while a negative value
will force it to run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
252cfb7cb8 maintenance: add loose-objects task
One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
28cb5e66dd maintenance: add prefetch task
When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foreground 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.
   While this looks confusing, this was documented and tested by
   b40a50264a (fetch: document and test --refmap="", 2020-01-21),
   including this sentence in the documentation:

	Providing an empty `<refspec>` to the `--refmap` option
	causes Git to ignore the configured refspecs and rely
	entirely on the refspecs supplied as command-line arguments.

3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/prefetch/<remote> refs that no
   longer appear on the remote.

5. --no-write-fetch-head prevents updating FETCH_HEAD.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
4ddc79b2da maintenance: add auto condition for commit-graph task
Instead of writing a new commit-graph in every 'git maintenance run
--auto' process (when maintenance.commit-graph.enalbed is configured to
be true), only write when there are "enough" commits not in a
commit-graph file.

This count is controlled by the maintenance.commit-graph.auto config
option.

To compute the count, use a depth-first search starting at each ref, and
leaving markers using the SEEN flag. If this count reaches the limit,
then terminate early and start the task. Otherwise, this operation will
peel every ref and parse the commit it points to. If these are all in
the commit-graph, then this is typically a very fast operation. Users
with many refs might feel a slow-down, and hence could consider updating
their limit to be very small. A negative value will force the step to
run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
65d655b52d maintenance: create maintenance.<task>.enabled config
Currently, a normal run of "git maintenance run" will only run the 'gc'
task, as it is the only one enabled. This is mostly for backwards-
compatible reasons since "git maintenance run --auto" commands replaced
previous "git gc --auto" commands after some Git processes. Users could
manually run specific maintenance tasks by calling "git maintenance run
--task=<task>" directly.

Allow users to customize which steps are run automatically using config.
The 'maintenance.<task>.enabled' option then can turn on these other
tasks (or turn off the 'gc' task).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
090511bc0b maintenance: add --task option
A user may want to only run certain maintenance tasks in a certain
order. Add the --task=<task> option, which allows a user to specify an
ordered list of tasks to run. These cannot be run multiple times,
however.

Here is where our array of maintenance_task pointers becomes critical.
We can sort the array of pointers based on the task order, but we do not
want to move the struct data itself in order to preserve the hashmap
references. We use the hashmap to match the --task=<task> arguments into
the task struct data.

Keep in mind that the 'enabled' member of the maintenance_task struct is
a placeholder for a future 'maintenance.<task>.enabled' config option.
Thus, we use the 'enabled' member to specify which tasks are run when
the user does not specify any --task=<task> arguments. The 'enabled'
member should be ignored if --task=<task> appears.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
663b2b1b90 maintenance: add commit-graph task
The first new task in the 'git maintenance' builtin is the
'commit-graph' task. This updates the commit-graph file
incrementally with the command

	git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
a95ce12430 maintenance: replace run_auto_gc()
The run_auto_gc() method is used in several places to trigger a check
for repo maintenance after some Git commands, such as 'git commit' or
'git fetch'.

To allow for extra customization of this maintenance activity, replace
the 'git gc --auto [--quiet]' call with one to 'git maintenance run
--auto [--quiet]'. As we extend the maintenance builtin with other
steps, users will be able to select different maintenance activities.

Rename run_auto_gc() to run_auto_maintenance() to be clearer what is
happening on this call, and to expose all callers in the current diff.
Rewrite the method to use a struct child_process to simplify the calls
slightly.

Since 'git fetch' already allows disabling the 'git gc --auto'
subprocess, add an equivalent option with a different name to be more
descriptive of the new behavior: '--[no-]maintenance'. Update the
documentation to include these options at the same time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
3ddaad0e06 maintenance: add --quiet option
Maintenance activities are commonly used as steps in larger scripts.
Providing a '--quiet' option allows those scripts to be less noisy when
run on a terminal window. Turn this mode on by default when stderr is
not a terminal.

Pipe the option to the 'git gc' child process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Derrick Stolee
2057d75038 maintenance: create basic maintenance runner
The 'gc' builtin is our current entrypoint for automatically maintaining
a repository. This one tool does many operations, such as repacking the
repository, packing refs, and rewriting the commit-graph file. The name
implies it performs "garbage collection" which means several different
things, and some users may not want to use this operation that rewrites
the entire object database.

Create a new 'maintenance' builtin that will become a more general-
purpose command. To start, it will only support the 'run' subcommand,
but will later expand to add subcommands for scheduling maintenance in
the background.

For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin.
In fact, the only option is the '--auto' toggle, which is handed
directly to the 'gc' builtin. The current change is isolated to this
simple operation to prevent more interesting logic from being lost in
all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the
two builtins. It is possible that we will have 'maintenance' replace the
'gc' builtin entirely at some point, leaving 'git gc' as an alias for
some specific arguments to 'git maintenance run'.

Create a new test_subcommand helper that allows us to test if a certain
subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a
file. A negation mode is available that will be used in later tests.

Helped-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:04 -07:00
Junio C Hamano
887952b8c6 fetch: optionally allow disabling FETCH_HEAD update
If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head".
The default is to write FETCH_HEAD, and the option is primarily
meant to be used with the "--no-" prefix to override this default,
because there is no matching fetch.writeFetchHEAD configuration
variable to flip the default to off (in which case, the positive
form may become necessary to defeat it).

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--write-fetch-head` does not force `git
fetch` to write the file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-18 12:56:57 -07:00
Junio C Hamano
2befe97201 Eighth batch
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 17:02:50 -07:00
Junio C Hamano
a555b514cd Merge branch 'so/log-diff-merges-opt'
Earlier, to countermand the implicit "-m" option when the
"--first-parent" option is used with "git log", we added the
"--[no-]diff-merges" option in the jk/log-fp-implies-m topic.  To
leave the door open to allow the "--diff-merges" option to take
values that instructs how patches for merge commits should be
computed (e.g. "cc"? "-p against first parent?"), redefine
"--diff-merges" to take non-optional value, and implement "off"
that means the same thing as "--no-diff-merges".

* so/log-diff-merges-opt:
  t/t4013: add test for --diff-merges=off
  doc/git-log: describe --diff-merges=off
  revision: change "--diff-merges" option to require parameter
2020-08-17 17:02:50 -07:00
Junio C Hamano
eca8c62a50 Merge branch 'jk/log-fp-implies-m'
"git log --first-parent -p" showed patches only for single-parent
commits on the first-parent chain; the "--first-parent" option has
been made to imply "-m".  Use "--no-diff-merges" to restore the
previous behaviour to omit patches for merge commits.

* jk/log-fp-implies-m:
  doc/git-log: clarify handling of merge commit diffs
  doc/git-log: move "-t" into diff-options list
  doc/git-log: drop "-r" diff option
  doc/git-log: move "Diff Formatting" from rev-list-options
  log: enable "-m" automatically with "--first-parent"
  revision: add "--no-diff-merges" option to counteract "-m"
  log: drop "--cc implies -m" logic
2020-08-17 17:02:49 -07:00
Junio C Hamano
47f0f94bc7 Merge branch 'al/bisect-first-parent'
"git bisect" learns the "--first-parent" option to find the first
breakage along the first-parent chain.

* al/bisect-first-parent:
  bisect: combine args passed to find_bisection()
  bisect: introduce first-parent flag
  cmd_bisect__helper: defer parsing no-checkout flag
  rev-list: allow bisect and first-parent flags
  t6030: modernize "git bisect run" tests
2020-08-17 17:02:45 -07:00
Junio C Hamano
95c687bf85 Merge branch 'hn/reftable-prep-part-2'
Further preliminary change to refs API.

* hn/reftable-prep-part-2:
  Make HEAD a PSEUDOREF rather than PER_WORKTREE.
  Modify pseudo refs through ref backend storage
  t1400: use git rev-parse for testing PSEUDOREF existence
2020-08-17 17:02:42 -07:00
Junio C Hamano
a00bda2b2f Merge branch 'dd/send-email-config'
Stop when "sendmail.*" configuration variables are defined, which
could be a mistaken attempt to define "sendemail.*" variables.

* dd/send-email-config:
  git-send-email: die if sendmail.* config is set
2020-08-17 17:02:41 -07:00
Junio C Hamano
878e727637 Seventh batch
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-13 14:13:59 -07:00
Junio C Hamano
5707ac426d Merge branch 'rp/blame-first-parent-doc'
The "git blame --first-parent" option was not documented, but now
it is.

* rp/blame-first-parent-doc:
  blame-options.txt: document --first-parent option
2020-08-13 14:13:40 -07:00
Junio C Hamano
d1a8a8979d Merge branch 'jt/has_object'
A new helper function has_object() has been introduced to make it
easier to mark object existence checks that do and don't want to
trigger lazy fetches, and a few such checks are converted using it.

* jt/has_object:
  fsck: do not lazy fetch known non-promisor object
  pack-objects: no fetch when allow-{any,promisor}
  apply: do not lazy fetch when applying binary
  sha1-file: introduce no-lazy-fetch has_object()
2020-08-13 14:13:39 -07:00
Junio C Hamano
7814e8a05a Sixth batch
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11 18:04:13 -07:00
Junio C Hamano
73a9255166 Merge branch 'tb/upload-pack-filters'
The component to respond to "git fetch" request is made more
configurable to selectively allow or reject object filtering
specification used for partial cloning.

* tb/upload-pack-filters:
  t5616: use test_i18ngrep for upload-pack errors
  upload-pack.c: introduce 'uploadpackfilter.tree.maxDepth'
  upload-pack.c: allow banning certain object filter(s)
  list_objects_filter_options: introduce 'list_object_filter_config_name'
2020-08-11 18:04:13 -07:00
Junio C Hamano
a3afa4becd Merge branch 'es/worktree-doc-cleanups'
Doc cleanup around "worktree".

* es/worktree-doc-cleanups:
  git-worktree.txt: link to man pages when citing other Git commands
  git-worktree.txt: make start of new sentence more obvious
  git-worktree.txt: fix minor grammatical issues
  git-worktree.txt: consistently use term "working tree"
  git-worktree.txt: employ fixed-width typeface consistently
2020-08-11 18:04:12 -07:00
Junio C Hamano
e0ad9574dd Merge branch 'bc/sha-256-part-3'
The final leg of SHA-256 transition.

* bc/sha-256-part-3: (39 commits)
  t: remove test_oid_init in tests
  docs: add documentation for extensions.objectFormat
  ci: run tests with SHA-256
  t: make SHA1 prerequisite depend on default hash
  t: allow testing different hash algorithms via environment
  t: add test_oid option to select hash algorithm
  repository: enable SHA-256 support by default
  setup: add support for reading extensions.objectformat
  bundle: add new version for use with SHA-256
  builtin/verify-pack: implement an --object-format option
  http-fetch: set up git directory before parsing pack hashes
  t0410: mark test with SHA1 prerequisite
  t5308: make test work with SHA-256
  t9700: make hash size independent
  t9500: ensure that algorithm info is preserved in config
  t9350: make hash size independent
  t9301: make hash size independent
  t9300: use $ZERO_OID instead of hard-coded object ID
  t9300: abstract away SHA-1-specific constants
  t8011: make hash size independent
  ...
2020-08-11 18:04:11 -07:00
Sergey Organov
405a2fdf99 doc/git-log: describe --diff-merges=off
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11 14:20:27 -07:00
Junio C Hamano
4f0a8be784 Fifth batch
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-10 10:24:04 -07:00
Junio C Hamano
995c71986a Merge branch 'pb/guide-docs'
Update "git help guides" documentation organization.

* pb/guide-docs:
  git.txt: add list of guides
  Documentation: don't hardcode command categories twice
  help: drop usage of 'common' and 'useful' for guides
  command-list.txt: add missing 'gitcredentials' and 'gitremote-helpers'
2020-08-10 10:24:04 -07:00
Junio C Hamano
d3e54edb93 Merge branch 'ny/notes-doc-sample-update'
Doc updates.

* ny/notes-doc-sample-update:
  docs: improve the example that illustrates git-notes path names
2020-08-10 10:24:02 -07:00
Junio C Hamano
46b225f153 Merge branch 'jk/strvec'
The argv_array API is useful for not just managing argv but any
"vector" (NULL-terminated array) of strings, and has seen adoption
to a certain degree.  It has been renamed to "strvec" to reduce the
barrier to adoption.

* jk/strvec:
  strvec: rename struct fields
  strvec: drop argv_array compatibility layer
  strvec: update documention to avoid argv_array
  strvec: fix indentation in renamed calls
  strvec: convert remaining callers away from argv_array name
  strvec: convert more callers away from argv_array name
  strvec: convert builtin/ callers away from argv_array name
  quote: rename sq_dequote_to_argv_array to mention strvec
  strvec: rename files from argv-array to strvec
  argv-array: rename to strvec
  argv-array: use size_t for count and alloc
2020-08-10 10:23:57 -07:00
Aaron Lipman
e8861ffc20 bisect: introduce first-parent flag
Upon seeing a merge commit when bisecting, this option may be used to
follow only the first parent.

In detecting regressions introduced through the merging of a branch, the
merge commit will be identified as introduction of the bug and its
ancestors will be ignored.

This option is particularly useful in avoiding false positives when a
merged branch contained broken or non-buildable commits, but the merge
itself was OK.

Signed-off-by: Aaron Lipman <alipman88@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-07 15:13:03 -07:00
Aaron Lipman
0fe305a5d3 rev-list: allow bisect and first-parent flags
Add first_parent_only parameter to find_bisection(), removing the
barrier that prevented combining the --bisect and --first-parent flags
when using git rev-list

Based-on-patch-by: Tiago Botelho <tiagonbotelho@hotmail.com>
Signed-off-by: Aaron Lipman <alipman88@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-07 15:11:59 -07:00
Raymond E. Pasco
11bc12ae1e blame-options.txt: document --first-parent option
blame/annotate have supported --first-parent since commit 95a4fb0eac
("blame: handle --first-parent"). This adds a blurb on that option to
the documentation.

Signed-off-by: Raymond E. Pasco <ray@ameretat.dev>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-06 14:08:10 -07:00
Jonathan Tan
ee47243d76 pack-objects: no fetch when allow-{any,promisor}
The options --missing=allow-{any,promisor} were introduced in caf3827e2f
("rev-list: add list-objects filtering support", 2017-11-22) with the
following note in the commit message:

    This patch introduces handling of missing objects to help
    debugging and development of the "partial clone" mechanism,
    and once the mechanism is implemented, for a power user to
    perform operations that are missing-object aware without
    incurring the cost of checking if a missing link is expected.

The idea that these options are missing-object aware (and thus do not
need to lazily fetch objects, unlike unaware commands that assume that
all objects are present) are assumed in later commits such as 07ef3c6604
("fetch test: use more robust test for filtered objects", 2020-01-15).

However, the current implementations of these options use
has_object_file(), which indeed lazily fetches missing objects. Teach
these implementations not to do so. Also, update the documentation of
these options to be clearer.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-06 13:01:03 -07:00
Philippe Blain
f442f28a81 git.txt: add list of guides
Not all man5/man7 guides are mentioned in the 'git(1)' documentation,
which makes the missing ones somewhat hard to find.

Add a list of the guides to git(1) by leveraging the existing
`Documentation/cmd-list.perl` script to generate a file `cmds-guide.txt`
which gets included in git.txt.

Also, do not hard-code the manual section '1'. Instead, use a regex so
that the manual section is discovered from the first line of each
`git*.txt` file.

This addition was hinted at in 1b81d8cb19 (help: use command-list.txt
for the source of guides, 2018-05-20).

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-04 18:34:02 -07:00
Junio C Hamano
e7a9807a62 Documentation: don't hardcode command categories twice
Instead of hard-coding the list of command categories in both
`Documentation/Makefile` and `Documentation/cmd-list.perl`, make the
Makefile the authoritative source and tweak `cmd-list.perl` so that it
receives the list of command categories as argument.

Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-04 18:34:02 -07:00
Philippe Blain
0371a764d2 help: drop usage of 'common' and 'useful' for guides
Since 1b81d8cb19 (help: use command-list.txt for the source of guides,
2018-05-20), all man5/man7 guides listed in command-list.txt appear in
the output of 'git help -g'.

However, 'git help -g' still prefixes this list with "The common Git
guides are:", which makes one wonder if there are others!

In the same spirit, the man page for 'git help' describes the '--guides'
option as listing 'useful' guides, which is not false per se but can
also be taken to mean that there are other guides that exist but are not
useful.

Instead of 'common' and 'useful', use 'Git concept guides' in both
places. To keep the code in line with this change, rename
help.c::list_common_guides_help to list_guides_help.

Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-04 18:34:01 -07:00