git-commit-vandalism/Documentation/git-maintenance.txt

110 lines
4.1 KiB
Plaintext
Raw Normal View History

maintenance: create basic maintenance runner The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 20:11:42 +02:00
git-maintenance(1)
==================
NAME
----
git-maintenance - Run tasks to optimize Git repository data
SYNOPSIS
--------
[verse]
'git maintenance' run [<options>]
DESCRIPTION
-----------
Run tasks to optimize Git repository data, speeding up other Git commands
and reducing storage requirements for the repository.
Git commands that add repository data, such as `git add` or `git fetch`,
are optimized for a responsive user experience. These commands do not take
time to optimize the Git data, since such optimizations scale with the full
size of the repository while these user commands each perform a relatively
small action.
The `git maintenance` command provides flexibility for how to optimize the
Git repository.
SUBCOMMANDS
-----------
run::
Run one or more maintenance tasks. If one or more `--task` options
are specified, then those tasks are run in that order. Otherwise,
the tasks are determined by which `maintenance.<task>.enabled`
config options are true. By default, only `maintenance.gc.enabled`
is true.
maintenance: create basic maintenance runner The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 20:11:42 +02:00
TASKS
-----
commit-graph::
The `commit-graph` job updates the `commit-graph` files incrementally,
then verifies that the written data is correct. The incremental
write is safe to run alongside concurrent Git processes since it
will not expire `.graph` files that were in the previous
`commit-graph-chain` file. They will be deleted by a later run based
on the expiration delay.
maintenance: add prefetch task When working with very large repositories, an incremental 'git fetch' command can download a large amount of data. If there are many other users pushing to a common repo, then this data can rival the initial pack-file size of a 'git clone' of a medium-size repo. Users may want to keep the data on their local repos as close as possible to the data on the remote repos by fetching periodically in the background. This can break up a large daily fetch into several smaller hourly fetches. The task is called "prefetch" because it is work done in advance of a foreground fetch to make that 'git fetch' command much faster. However, if we simply ran 'git fetch <remote>' in the background, then the user running a foreground 'git fetch <remote>' would lose some important feedback when a new branch appears or an existing branch updates. This is especially true if a remote branch is force-updated and this isn't noticed by the user because it occurred in the background. Further, the functionality of 'git push --force-with-lease' becomes suspect. When running 'git fetch <remote> <options>' in the background, use the following options for careful updating: 1. --no-tags prevents getting a new tag when a user wants to see the new tags appear in their foreground fetches. 2. --refmap= removes the configured refspec which usually updates refs/remotes/<remote>/* with the refs advertised by the remote. While this looks confusing, this was documented and tested by b40a50264ac (fetch: document and test --refmap="", 2020-01-21), including this sentence in the documentation: Providing an empty `<refspec>` to the `--refmap` option causes Git to ignore the configured refspecs and rely entirely on the refspecs supplied as command-line arguments. 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*" we can ensure that we actually load the new values somewhere in our refspace while not updating refs/heads or refs/remotes. By storing these refs here, the commit-graph job will update the commit-graph with the commits from these hidden refs. 4. --prune will delete the refs/prefetch/<remote> refs that no longer appear on the remote. 5. --no-write-fetch-head prevents updating FETCH_HEAD. We've been using this step as a critical background job in Scalar [1] (and VFS for Git). This solved a pain point that was showing up in user reports: fetching was a pain! Users do not like waiting to download the data that was created while they were away from their machines. After implementing background fetch, the foreground fetch commands sped up significantly because they mostly just update refs and download a small amount of new data. The effect is especially dramatic when paried with --no-show-forced-udpates (through fetch.showForcedUpdates=false). [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 14:33:31 +02:00
prefetch::
The `prefetch` task updates the object directory with the latest
objects from all registered remotes. For each remote, a `git fetch`
command is run. The refmap is custom to avoid updating local or remote
branches (those in `refs/heads` or `refs/remotes`). Instead, the
remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
not updated.
+
This is done to avoid disrupting the remote-tracking branches. The end users
expect these refs to stay unmoved unless they initiate a fetch. With prefetch
task, however, the objects necessary to complete a later real fetch would
already be obtained, so the real fetch would go faster. In the ideal case,
it will just become an update to bunch of remote-tracking branches without
any object transfer.
maintenance: create basic maintenance runner The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 20:11:42 +02:00
gc::
Clean up unnecessary files and optimize the local repository. "GC"
stands for "garbage collection," but this task performs many
smaller tasks. This task can be expensive for large repositories,
as it repacks all Git objects into a single pack-file. It can also
be disruptive in some situations, as it deletes stale data. See
linkgit:git-gc[1] for more details on garbage collection in Git.
maintenance: add loose-objects task One goal of background maintenance jobs is to allow a user to disable auto-gc (gc.auto=0) but keep their repository in a clean state. Without any cleanup, loose objects will clutter the object database and slow operations. In addition, the loose objects will take up extra space because they are not stored with deltas against similar objects. Create a 'loose-objects' task for the 'git maintenance run' command. This helps clean up loose objects without disrupting concurrent Git commands using the following sequence of events: 1. Run 'git prune-packed' to delete any loose objects that exist in a pack-file. Concurrent commands will prefer the packed version of the object to the loose version. (Of course, there are exceptions for commands that specifically care about the location of an object. These are rare for a user to run on purpose, and we hope a user that has selected background maintenance will not be trying to do foreground maintenance.) 2. Run 'git pack-objects' on a batch of loose objects. These objects are grouped by scanning the loose object directories in lexicographic order until listing all loose objects -or- reaching 50,000 objects. This is more than enough if the loose objects are created only by a user doing normal development. We noticed users with _millions_ of loose objects because VFS for Git downloads blobs on-demand when a file read operation requires populating a virtual file. This step is based on a similar step in Scalar [1] and VFS for Git. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 14:33:32 +02:00
loose-objects::
The `loose-objects` job cleans up loose objects and places them into
pack-files. In order to prevent race conditions with concurrent Git
commands, it follows a two-step process. First, it deletes any loose
objects that already exist in a pack-file; concurrent Git processes
will examine the pack-file for the object data instead of the loose
object. Second, it creates a new pack-file (starting with "loose-")
containing a batch of loose objects. The batch size is limited to 50
thousand objects to prevent the job from taking too long on a
repository with many loose objects. The `gc` task writes unreachable
objects as loose objects to be cleaned up by a later step only if
they are not re-added to a pack-file; for this reason it is not
advisable to enable both the `loose-objects` and `gc` tasks at the
same time.
maintenance: create basic maintenance runner The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 20:11:42 +02:00
OPTIONS
-------
--auto::
When combined with the `run` subcommand, run maintenance tasks
only if certain thresholds are met. For example, the `gc` task
runs when the number of loose objects exceeds the number stored
in the `gc.auto` config setting, or when the number of pack-files
exceeds the `gc.autoPackLimit` config setting.
--quiet::
Do not report progress or other information over `stderr`.
--task=<task>::
If this option is specified one or more times, then only run the
specified tasks in the specified order. If no `--task=<task>`
arguments are specified, then only the tasks with
`maintenance.<task>.enabled` configured as `true` are considered.
See the 'TASKS' section for the list of accepted `<task>` values.
maintenance: create basic maintenance runner The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 20:11:42 +02:00
GIT
---
Part of the linkgit:git[1] suite