Our order for processing of entries means that if we have a tree of
files that looks like
Makefile
src/moduleA/foo.c
src/moduleA/bar.c
src/moduleB/baz.c
src/moduleB/umm.c
tokens.txt
Then we will process paths in the order of the leftmost column below. I
have added two additional columns that help explain the algorithm that
follows; the 2nd column is there to remind us we have oid & mode info we
are tracking for each of these paths (which differs between the paths
which I'm not representing well here), and the third column annotates
the parent directory of the entry:
tokens.txt <version_info> ""
src/moduleB/umm.c <version_info> src/moduleB
src/moduleB/baz.c <version_info> src/moduleB
src/moduleB <version_info> src
src/moduleA/foo.c <version_info> src/moduleA
src/moduleA/bar.c <version_info> src/moduleA
src/moduleA <version_info> src
src <version_info> ""
Makefile <version_info> ""
When the parent directory changes, if it's a subdirectory of the previous
parent directory (e.g. "" -> src/moduleB) then we can just keep appending.
If the parent directory differs from the previous parent directory and is
not a subdirectory, then we should process that directory.
So, for example, when we get to this point:
tokens.txt <version_info> ""
src/moduleB/umm.c <version_info> src/moduleB
src/moduleB/baz.c <version_info> src/moduleB
and note that the next entry (src/moduleB) has a different parent than
the last one that isn't a subdirectory, we should write out a tree for it
100644 blob <HASH> umm.c
100644 blob <HASH> baz.c
then pop all the entries under that directory while recording the new
hash for that directory, leaving us with
tokens.txt <version_info> ""
src/moduleB <new version_info> src
This process repeats until at the end we get to
tokens.txt <version_info> ""
src <new version_info> ""
Makefile <version_info> ""
and then we can write out the toplevel tree. Since we potentially have
entries in our string_list corresponding to multiple different toplevel
directories, e.g. a slightly different repository might have:
whizbang.txt <version_info> ""
tokens.txt <version_info> ""
src/moduleD <new version_info> src
src/moduleC <new version_info> src
src/moduleB <new version_info> src
src/moduleA/foo.c <version_info> src/moduleA
src/moduleA/bar.c <version_info> src/moduleA
When src/moduleA is popped off, we need to know that the "last
directory" reverts back to src, and how many entries in our string_list
are associated with that parent directory. So I use an auxiliary
offsets string_list which would have (parent_directory,offset)
information of the form
"" 0
src 2
src/moduleA 5
Whenever I write out a tree for a subdirectory, I set versions.nr to
the final offset value and then decrement offsets.nr...and then add
an entry to versions with a hash for the new directory.
The idea is relatively simple, there's just a lot of accounting to
implement this.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create a new function, write_tree(), which will take a list of
basenames, modes, and oids for a single directory and create a tree
object in the object-store. We do not yet have just basenames, modes,
and oids for just a single directory (we have a mixture of entries from
all directory levels in the hierarchy) so we still die() before the
current call to write_tree(), but the next patch will rectify that.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As a step towards transforming the processed path->conflict_info entries
into an actual tree object, start recording basenames, modes, and oids
in a dir_metadata structure. Subsequent commits will make use of this
to actually write a tree.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We want to handle paths below a directory before needing to handle the
directory itself. Also, we want to handle the directory immediately
after the paths below it, so we can't use simple lexicographic ordering
from strcmp (which would insert foo.txt between foo and foo/file.c).
Copy string_list_df_name_compare() from merge-recursive.c, and set up a
string list of paths sorted by that function so that we can iterate in
the desired order.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a process_entries() implementation that just loops over the paths
and processes each one individually with an auxiliary process_entry()
call. Add a basic process_entry() as well, which handles several cases
but leaves a few of the more involved ones with die-not-implemented
messages. Also, although process_entries() is supposed to create a
tree, it does not yet have code to do so -- except in the special case
of merging completely empty trees.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When all three trees have the same oid, there is no need to recurse into
these trees to find that all files within them happen to match. We can
just record any one of the trees as the resolution of merging that
particular path.
Immediately resolving trees for other types of trivial tree merges (such
as one side matches the merge base, or the two sides match each other)
would prevent us from detecting renames for some paths, and thus prevent
us from doing three-way content merges for those paths whose renames we
did not detect.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create a helper function, setup_path_info(), which can be used to record
all the information we want in a merged_info or conflict_info. While
there is currently only one caller of this new function, and some of its
particular parameters are fixed, future callers of this function will be
added later.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Three-way merges, by their nature, are going to often have two or more
trees match at a given subdirectory. We can avoid calling
fill_tree_descriptor() on the same tree by checking when these trees
match. Noting when various oids match will also be useful in other
calculations and optimizations as well.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This does not actually collect any necessary info other than the
pathnames involved, since it just allocates an all-zero conflict_info
and stuffs that into paths. However, it invokes the traverse_trees()
machinery to walk over all the paths and sets up the basic
infrastructure we need.
I have left out a few obvious optimizations to try to make this patch as
short and obvious as possible. A subsequent patch will add some of
those back in with some more useful data fields before we introduce a
patch that actually sets up the conflict_info fields.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Various places in merge-recursive used an err() function when it hit
some kind of unrecoverable error. That code was from the reusable bits
of merge-recursive.c that we liked, such as merge_3way, writing object
files to the object store, reading blobs from the object store, etc. So
create a similar function to allow us to port that code over, and use it
for when we detect problems returned from collect_merge_info()'s
traverse_trees() call, which we will be adding next.
While we are at it, also add more documentation for the "clean" field
from struct merge_result, particularly since the name suggests a boolean
but it is not quite one and this is our first non-boolean usage.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In my cursory investigation, histogram diffs are about 2% slower than
Myers diffs. Others have probably done more detailed benchmarks. But,
in short, histogram diffs have been around for years and in a number of
cases provide obviously better looking diffs where Myers diffs are
unintelligible but the performance hit has kept them from becoming the
default.
However, there are real merge bugs we know about that have triggered on
git.git and linux.git, which I don't have a clue how to address without
the additional information that I believe is provided by histogram
diffs. See the following:
https://lore.kernel.org/git/20190816184051.GB13894@sigill.intra.peff.net/https://lore.kernel.org/git/CABPp-BHvJHpSJT7sdFwfNcPn_sOXwJi3=o14qjZS3M8Rzcxe2A@mail.gmail.com/https://lore.kernel.org/git/CABPp-BGtez4qjbtFT1hQoREfcJPmk9MzjhY5eEq1QhXT23tFOw@mail.gmail.com/
I don't like mismerges. I really don't like silent mismerges. While I
am sometimes willing to make performance and correctness tradeoff, I'm
much more interested in correctness in general. I want to fix the above
bugs. I have not yet started doing so, but I believe histogram diff at
least gives me an angle. Unfortunately, I can't rely on using the
information from histogram diff unless it's in use. And it hasn't been
used because of a few percentage performance hit.
In testcases I have looked at, merge-ort is _much_ faster than
merge-recursive for non-trivial merges/rebases/cherry-picks. As such,
this is a golden opportunity to switch out the underlying diff algorithm
(at least the one used by the merge machinery; git-diff and git-log are
separate questions); doing so will allow me to get additional data and
improved diffs, and I believe it will help me fix the above bugs at some
point in the future.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
merge_start() basically does a bunch of sanity checks, then allocates
and initializes opt->priv -- a struct merge_options_internal.
Most of the sanity checks are usable as-is. The
allocation/intialization is a bit different since merge-ort has a very
different merge_options_internal than merge-recursive, but the idea is
the same.
The weirdest part here is that merge-ort and merge-recursive use the
same struct merge_options, even though merge_options has a number of
fields that are oddly specific to merge-recursive's internal
implementation and don't even make sense with merge-ort's high-level
design (e.g. buffer_output, which merge-ort has to always do). I reused
the same data structure because:
* most the fields made sense to both merge algorithms
* making a new struct would have required making new enums or somehow
externalizing them, and that was getting messy.
* it simplifies converting the existing callers by not having to
have different code paths for merge_options setup.
I also marked detect_renames as ignored. We can revisit that later, but
in short: merge-recursive allowed turning off rename detection because
it was sometimes glacially slow. When you speed something up by a few
orders of magnitude, it's worth revisiting whether that justification is
still relevant. Besides, if folks find it's still too slow, perhaps
they have a better scaling case than I could find and maybe it turns up
some more optimizations we can add. If it still is needed as an option,
it is easy to add later.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
merge_ort_nonrecursive_internal() will be used by both
merge_inmemory_nonrecursive() and merge_inmemory_recursive(); let's
focus on it for now. It involves some setup -- merge_start() --
followed by the following chain of functions:
collect_merge_info()
This function will populate merge_options_internal's paths field,
via a call to traverse_trees() and a new callback that will be added
later.
detect_and_process_renames()
This function will detect renames, and then adjust entries in paths
to move conflict stages from old pathnames into those for new
pathnames, so that the next step doesn't have to think about renames
and just can do three-way content merging and such.
process_entries()
This function determines how to take the various stages (versions of
a file from the three different sides) and merge them, and whether
to mark the result as conflicted or cleanly merged. It also writes
out these merged file versions as it goes to create a tree.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Set up some basic internal data structures. The only carry-over from
merge-recursive.c is call_depth, though needed_rename_limit will be
added later.
The central piece of data will definitely be the strmap "paths", which
will map every relevant pathname under consideration to either a
merged_info or a conflict_info. ("conflicted" is a strmap that is a
subset of "paths".)
merged_info contains all relevant information for a non-conflicted
entry. conflict_info contains a merged_info, plus any additional
information about a conflict such as the higher orders stages involved
and the names of the paths those came from (handy once renames get
involved). If an entry remains conflicted, the merged_info portion of a
conflict_info will later be filled with whatever version of the file
should be placed in the working directory (e.g. an as-merged-as-possible
variation that contains conflict markers).
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is the beginning of a new merge strategy. While there are some API
differences, and the implementation has some differences in behavior, it
is essentially meant as an eventual drop-in replacement for
merge-recursive.c. However, it is being built to exist side-by-side
with merge-recursive so that we have plenty of time to find out how
those differences pan out in the real world while people can still fall
back to merge-recursive. (Also, I intend to avoid modifying
merge-recursive during this process, to keep it stable.)
The primary difference noticable here is that the updating of the
working tree and index is not done simultaneously with the merge
algorithm, but is a separate post-processing step. The new API is
designed so that one can do repeated merges (e.g. during a rebase or
cherry-pick) and only update the index and working tree one time at the
end instead of updating it with every intermediate result. Also, one
can perform a merge between two branches, neither of which match the
index or the working tree, without clobbering the index or working tree.
The next three commits will demonstrate various uses of this new API.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>