839a66349e
Edit and expand the sparse-index design document with the plan for guarding index operations with ensure_full_index(). Notably, the plan has changed to not have an expand_to_path() method in favor of checking for a sparse-directory hit inside of the index_path_pos() API. The changes that follow this one will incrementally add ensure_full_index() guards to iterations over all cache entries. Some iterations over the cache entries are not protected due to a few categories listed in the document. Since these are not being modified, here is a short list of the files and methods that will not receive these guards: Looking for non-zero stage: * builtin/add.c:chmod_pathspec() * builtin/merge.c:count_unmerged_entries() * merge-ort.c:record_conflicted_index_entries() * read-cache.c:unmerged_index() * rerere.c:check_one_conflict(), find_conflict(), rerere_remaining() * revision.c:prepare_show_merge() * sequencer.c:append_conflicts_hint() * wt-status.c:wt_status_collect_changes_initial() Looking for submodules: * builtin/submodule--helper.c:module_list_compute() * submodule.c: several methods * worktree.c:validate_no_submodules() Part of the index API: * name-hash.c: lazy init methods * preload-index.c:preload_thread(), preload_index() * read-cache.c: file format methods Checking for correct order of cache entries: * read-cache.c:check_ce_order() Ignores SKIP_WORKTREE entries or already aware: * unpack-trees.c:mark_new_skip_worktree() * wt-status.c:wt_status_check_sparse_checkout() Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
209 lines
9.3 KiB
Plaintext
209 lines
9.3 KiB
Plaintext
Git Sparse-Index Design Document
|
|
================================
|
|
|
|
The sparse-checkout feature allows users to focus a working directory on
|
|
a subset of the files at HEAD. The cone mode patterns, enabled by
|
|
`core.sparseCheckoutCone`, allow for very fast pattern matching to
|
|
discover which files at HEAD belong in the sparse-checkout cone.
|
|
|
|
Three important scale dimensions for a Git working directory are:
|
|
|
|
* `HEAD`: How many files are present at `HEAD`?
|
|
|
|
* Populated: How many files are within the sparse-checkout cone.
|
|
|
|
* Modified: How many files has the user modified in the working directory?
|
|
|
|
We will use big-O notation -- O(X) -- to denote how expensive certain
|
|
operations are in terms of these dimensions.
|
|
|
|
These dimensions are ordered by their magnitude: users (typically) modify
|
|
fewer files than are populated, and we can only populate files at `HEAD`.
|
|
|
|
Problems occur if there is an extreme imbalance in these dimensions. For
|
|
example, if `HEAD` contains millions of paths but the populated set has
|
|
only tens of thousands, then commands like `git status` and `git add` can
|
|
be dominated by operations that require O(`HEAD`) operations instead of
|
|
O(Populated). Primarily, the cost is in parsing and rewriting the index,
|
|
which is filled primarily with files at `HEAD` that are marked with the
|
|
`SKIP_WORKTREE` bit.
|
|
|
|
The sparse-index intends to take these commands that read and modify the
|
|
index from O(`HEAD`) to O(Populated). To do this, we need to modify the
|
|
index format in a significant way: add "sparse directory" entries.
|
|
|
|
With cone mode patterns, it is possible to detect when an entire
|
|
directory will have its contents outside of the sparse-checkout definition.
|
|
Instead of listing all of the files it contains as individual entries, a
|
|
sparse-index contains an entry with the directory name, referencing the
|
|
object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
|
|
If we need to discover the details for paths within that directory, we
|
|
can parse trees to find that list.
|
|
|
|
At time of writing, sparse-directory entries violate expectations about the
|
|
index format and its in-memory data structure. There are many consumers in
|
|
the codebase that expect to iterate through all of the index entries and
|
|
see only files. In fact, these loops expect to see a reference to every
|
|
staged file. One way to handle this is to parse trees to replace a
|
|
sparse-directory entry with all of the files within that tree as the index
|
|
is loaded. However, parsing trees is slower than parsing the index format,
|
|
so that is a slower operation than if we left the index alone. The plan is
|
|
to make all of these integrations "sparse aware" so this expansion through
|
|
tree parsing is unnecessary and they use fewer resources than when using a
|
|
full index.
|
|
|
|
The implementation plan below follows four phases to slowly integrate with
|
|
the sparse-index. The intention is to incrementally update Git commands to
|
|
interact safely with the sparse-index without significant slowdowns. This
|
|
may not always be possible, but the hope is that the primary commands that
|
|
users need in their daily work are dramatically improved.
|
|
|
|
Phase I: Format and initial speedups
|
|
------------------------------------
|
|
|
|
During this phase, Git learns to enable the sparse-index and safely parse
|
|
one. Protections are put in place so that every consumer of the in-memory
|
|
data structure can operate with its current assumption of every file at
|
|
`HEAD`.
|
|
|
|
At first, every index parse will call a helper method,
|
|
`ensure_full_index()`, which scans the index for sparse-directory entries
|
|
(pointing to trees) and replaces them with the full list of paths (with
|
|
blob contents) by parsing tree objects. This will be slower in all cases.
|
|
The only noticeable change in behavior will be that the serialized index
|
|
file contains sparse-directory entries.
|
|
|
|
To start, we use a new required index extension, `sdir`, to allow
|
|
inserting sparse-directory entries into indexes with file format
|
|
versions 2, 3, and 4. This prevents Git versions that do not understand
|
|
the sparse-index from operating on one, while allowing tools that do not
|
|
understand the sparse-index to operate on repositories as long as they do
|
|
not interact with the index. A new format, index v5, will be introduced
|
|
that includes sparse-directory entries by default. It might also
|
|
introduce other features that have been considered for improving the
|
|
index, as well.
|
|
|
|
Next, consumers of the index will be guarded against operating on a
|
|
sparse-index by inserting calls to `ensure_full_index()` or
|
|
`expand_index_to_path()`. If a specific path is requested, then those will
|
|
be protected from within the `index_file_exists()` and `index_name_pos()`
|
|
API calls: they will call `ensure_full_index()` if necessary. The
|
|
intention here is to preserve existing behavior when interacting with a
|
|
sparse-checkout. We don't want a change to happen by accident, without
|
|
tests. Many of these locations may not need any change before removing the
|
|
guards, but we should not do so without tests to ensure the expected
|
|
behavior happens.
|
|
|
|
It may be desirable to _change_ the behavior of some commands in the
|
|
presence of a sparse index or more generally in any sparse-checkout
|
|
scenario. In such cases, these should be carefully communicated and
|
|
tested. No such behavior changes are intended during this phase.
|
|
|
|
During a scan of the codebase, not every iteration of the cache entries
|
|
needs an `ensure_full_index()` check. The basic reasons include:
|
|
|
|
1. The loop is scanning for entries with non-zero stage. These entries
|
|
are not collapsed into a sparse-directory entry.
|
|
|
|
2. The loop is scanning for submodules. These entries are not collapsed
|
|
into a sparse-directory entry.
|
|
|
|
3. The loop is part of the index API, especially around reading or
|
|
writing the format.
|
|
|
|
4. The loop is checking for correct order of cache entries and that is
|
|
correct if and only if the sparse-directory entries are in the correct
|
|
location.
|
|
|
|
5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
|
|
otherwise already aware of sparse directory entries.
|
|
|
|
6. The sparse-index is disabled at this point when using the split-index
|
|
feature, so no effort is made to protect the split-index API.
|
|
|
|
Even after inserting these guards, we will keep expanding sparse-indexes
|
|
for most Git commands using the `command_requires_full_index` repository
|
|
setting. This setting will be on by default and disabled one builtin at a
|
|
time until we have sufficient confidence that all of the index operations
|
|
are properly guarded.
|
|
|
|
To complete this phase, the commands `git status` and `git add` will be
|
|
integrated with the sparse-index so that they operate with O(Populated)
|
|
performance. They will be carefully tested for operations within and
|
|
outside the sparse-checkout definition.
|
|
|
|
Phase II: Careful integrations
|
|
------------------------------
|
|
|
|
This phase focuses on ensuring that all index extensions and APIs work
|
|
well with a sparse-index. This requires significant increases to our test
|
|
coverage, especially for operations that interact with the working
|
|
directory outside of the sparse-checkout definition. Some of these
|
|
behaviors may not be the desirable ones, such as some tests already
|
|
marked for failure in `t1092-sparse-checkout-compatibility.sh`.
|
|
|
|
The index extensions that may require special integrations are:
|
|
|
|
* FS Monitor
|
|
* Untracked cache
|
|
|
|
While integrating with these features, we should look for patterns that
|
|
might lead to better APIs for interacting with the index. Coalescing
|
|
common usage patterns into an API call can reduce the number of places
|
|
where sparse-directories need to be handled carefully.
|
|
|
|
Phase III: Important command speedups
|
|
-------------------------------------
|
|
|
|
At this point, the patterns for testing and implementing sparse-directory
|
|
logic should be relatively stable. This phase focuses on updating some of
|
|
the most common builtins that use the index to operate as O(Populated).
|
|
Here is a potential list of commands that could be valuable to integrate
|
|
at this point:
|
|
|
|
* `git commit`
|
|
* `git checkout`
|
|
* `git merge`
|
|
* `git rebase`
|
|
|
|
Hopefully, commands such as `git merge` and `git rebase` can benefit
|
|
instead from merge algorithms that do not use the index as a data
|
|
structure, such as the merge-ORT strategy. As these topics mature, we
|
|
may enable the ORT strategy by default for repositories using the
|
|
sparse-index feature.
|
|
|
|
Along with `git status` and `git add`, these commands cover the majority
|
|
of users' interactions with the working directory. In addition, we can
|
|
integrate with these commands:
|
|
|
|
* `git grep`
|
|
* `git rm`
|
|
|
|
These have been proposed as some whose behavior could change when in a
|
|
repo with a sparse-checkout definition. It would be good to include this
|
|
behavior automatically when using a sparse-index. Some clarity is needed
|
|
to make the behavior switch clear to the user.
|
|
|
|
This phase is the first where parallel work might be possible without too
|
|
much conflicts between topics.
|
|
|
|
Phase IV: The long tail
|
|
-----------------------
|
|
|
|
This last phase is less a "phase" and more "the new normal" after all of
|
|
the previous work.
|
|
|
|
To start, the `command_requires_full_index` option could be removed in
|
|
favor of expanding only when hitting an API guard.
|
|
|
|
There are many Git commands that could use special attention to operate as
|
|
O(Populated), while some might be so rare that it is acceptable to leave
|
|
them with additional overhead when a sparse-index is present.
|
|
|
|
Here are some commands that might be useful to update:
|
|
|
|
* `git sparse-checkout set`
|
|
* `git am`
|
|
* `git clean`
|
|
* `git stash`
|