52fe41ff1c
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in19575c7
(multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added ince1e4a1
(midx: implement midx_repack(), 2019-06-10). The 'incremental-repack' task runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed byaf96fe3
(midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
128 lines
5.1 KiB
Plaintext
128 lines
5.1 KiB
Plaintext
git-maintenance(1)
|
|
==================
|
|
|
|
NAME
|
|
----
|
|
git-maintenance - Run tasks to optimize Git repository data
|
|
|
|
|
|
SYNOPSIS
|
|
--------
|
|
[verse]
|
|
'git maintenance' run [<options>]
|
|
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
Run tasks to optimize Git repository data, speeding up other Git commands
|
|
and reducing storage requirements for the repository.
|
|
|
|
Git commands that add repository data, such as `git add` or `git fetch`,
|
|
are optimized for a responsive user experience. These commands do not take
|
|
time to optimize the Git data, since such optimizations scale with the full
|
|
size of the repository while these user commands each perform a relatively
|
|
small action.
|
|
|
|
The `git maintenance` command provides flexibility for how to optimize the
|
|
Git repository.
|
|
|
|
SUBCOMMANDS
|
|
-----------
|
|
|
|
run::
|
|
Run one or more maintenance tasks. If one or more `--task` options
|
|
are specified, then those tasks are run in that order. Otherwise,
|
|
the tasks are determined by which `maintenance.<task>.enabled`
|
|
config options are true. By default, only `maintenance.gc.enabled`
|
|
is true.
|
|
|
|
TASKS
|
|
-----
|
|
|
|
commit-graph::
|
|
The `commit-graph` job updates the `commit-graph` files incrementally,
|
|
then verifies that the written data is correct. The incremental
|
|
write is safe to run alongside concurrent Git processes since it
|
|
will not expire `.graph` files that were in the previous
|
|
`commit-graph-chain` file. They will be deleted by a later run based
|
|
on the expiration delay.
|
|
|
|
prefetch::
|
|
The `prefetch` task updates the object directory with the latest
|
|
objects from all registered remotes. For each remote, a `git fetch`
|
|
command is run. The refmap is custom to avoid updating local or remote
|
|
branches (those in `refs/heads` or `refs/remotes`). Instead, the
|
|
remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
|
|
not updated.
|
|
+
|
|
This is done to avoid disrupting the remote-tracking branches. The end users
|
|
expect these refs to stay unmoved unless they initiate a fetch. With prefetch
|
|
task, however, the objects necessary to complete a later real fetch would
|
|
already be obtained, so the real fetch would go faster. In the ideal case,
|
|
it will just become an update to bunch of remote-tracking branches without
|
|
any object transfer.
|
|
|
|
gc::
|
|
Clean up unnecessary files and optimize the local repository. "GC"
|
|
stands for "garbage collection," but this task performs many
|
|
smaller tasks. This task can be expensive for large repositories,
|
|
as it repacks all Git objects into a single pack-file. It can also
|
|
be disruptive in some situations, as it deletes stale data. See
|
|
linkgit:git-gc[1] for more details on garbage collection in Git.
|
|
|
|
loose-objects::
|
|
The `loose-objects` job cleans up loose objects and places them into
|
|
pack-files. In order to prevent race conditions with concurrent Git
|
|
commands, it follows a two-step process. First, it deletes any loose
|
|
objects that already exist in a pack-file; concurrent Git processes
|
|
will examine the pack-file for the object data instead of the loose
|
|
object. Second, it creates a new pack-file (starting with "loose-")
|
|
containing a batch of loose objects. The batch size is limited to 50
|
|
thousand objects to prevent the job from taking too long on a
|
|
repository with many loose objects. The `gc` task writes unreachable
|
|
objects as loose objects to be cleaned up by a later step only if
|
|
they are not re-added to a pack-file; for this reason it is not
|
|
advisable to enable both the `loose-objects` and `gc` tasks at the
|
|
same time.
|
|
|
|
incremental-repack::
|
|
The `incremental-repack` job repacks the object directory
|
|
using the `multi-pack-index` feature. In order to prevent race
|
|
conditions with concurrent Git commands, it follows a two-step
|
|
process. First, it calls `git multi-pack-index expire` to delete
|
|
pack-files unreferenced by the `multi-pack-index` file. Second, it
|
|
calls `git multi-pack-index repack` to select several small
|
|
pack-files and repack them into a bigger one, and then update the
|
|
`multi-pack-index` entries that refer to the small pack-files to
|
|
refer to the new pack-file. This prepares those small pack-files
|
|
for deletion upon the next run of `git multi-pack-index expire`.
|
|
The selection of the small pack-files is such that the expected
|
|
size of the big pack-file is at least the batch size; see the
|
|
`--batch-size` option for the `repack` subcommand in
|
|
linkgit:git-multi-pack-index[1]. The default batch-size is zero,
|
|
which is a special case that attempts to repack all pack-files
|
|
into a single pack-file.
|
|
|
|
OPTIONS
|
|
-------
|
|
--auto::
|
|
When combined with the `run` subcommand, run maintenance tasks
|
|
only if certain thresholds are met. For example, the `gc` task
|
|
runs when the number of loose objects exceeds the number stored
|
|
in the `gc.auto` config setting, or when the number of pack-files
|
|
exceeds the `gc.autoPackLimit` config setting.
|
|
|
|
--quiet::
|
|
Do not report progress or other information over `stderr`.
|
|
|
|
--task=<task>::
|
|
If this option is specified one or more times, then only run the
|
|
specified tasks in the specified order. If no `--task=<task>`
|
|
arguments are specified, then only the tasks with
|
|
`maintenance.<task>.enabled` configured as `true` are considered.
|
|
See the 'TASKS' section for the list of accepted `<task>` values.
|
|
|
|
GIT
|
|
---
|
|
Part of the linkgit:git[1] suite
|