git-commit-vandalism

Author	SHA1	Message	Date
Elijah Newren	2fd9eda462	merge-ort: precompute whether directory rename detection is needed The point of directory rename detection is that if one side of history renames a directory, and the other side adds new files under the old directory, then the merge can move those new files into the new directory. This leads to the following important observation: * If the other side does not add any new files under the old directory, we do not need to detect any renames for that directory. Similarly, directory rename detection had an important requirement: * If a directory still exists on one side of history, it has not been renamed on that side of history. (See section 4 of t6423 or Documentation/technical/directory-rename-detection.txt for more details). Using these two bits of information, we note that directory rename detection is only needed in cases where (1) directories exist in the merge base and on one side of history (i.e. dirmask == 3 or dirmask == 5), and (2) where there is some new file added to that directory on the side where it still exists (thus where the file has filemask == 2 or filemask == 4, respectively). This has to be done in two steps, because we have the dirmask when we are first considering the directory, and won't get the filemasks for the files within it until we recurse into that directory. So, we save dir_rename_mask = dirmask - 1 when we hit a directory that is missing on one side, and then later look for cases of filemask == dir_rename_mask One final note is that as soon as we hit a directory that needs directory rename detection, we will need to detect renames in all subdirectories of that directory as well due to the "majority rules" decision when files are renamed into different directory hierarchies. We arbitrarily use the special value of 0x07 to record when we've hit such a directory. The combination of all the above mean that we introduce a variable named dir_rename_mask (couldn't think of a better name) which has one of the following values as we traverse into a directory: * 0x00: directory rename detection not needed * 0x02 or 0x04: directory rename detection only needed if files added * 0x07: directory rename detection definitely needed We then pass this value through to add_pairs() so that it can mark location_relevant as true only when dir_rename_mask is 0x07. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-10 22:18:05 -08:00
Elijah Newren	a68e6cea59	merge-ort: introduce wrappers for alternate tree traversal Add traverse_trees_wrapper() and traverse_trees_wrapper_callback() functions. The former runs traverse_trees() with info->fn set to traverse_trees_wrapper_callback, in order to simply save all the entries without processing or recursing into any of them. This step allows extra computation to be done (e.g. checking some condition across all files) that can be used later. Then, after that is completed, it iterates over all the saved entries and calls the original info->fn callback with the saved data. Currently, this does nothing more than marginally slowing down the tree traversal since we do not take advantage of the opportunity to compute anything special in traverse_trees_wrapper_callback(), and thus the real callback will be called identically as it would have been without this extra wrapper. However, a subsequent commit will add some special computation of some values that the real callback will be able to use. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-10 22:18:05 -08:00
Elijah Newren	beb06145f8	merge-ort: add data structures for an alternate tree traversal In order to determine whether directory rename detection is needed, we as a pre-requisite need a way to traverse through all the files in a given tree before visiting any directories within that tree. traverse_trees() only iterates through the entries in the order they appear, so add some data structures that will store all the entries as we iterate through them in traverse_trees(), which will allow us to re-traverse them in our desired order. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-10 22:18:04 -08:00
Elijah Newren	32a56dfb99	merge-ort: precompute subset of sources for which we need rename detection rename detection works by trying to pair all file deletions (or "sources") with all file additions (or "destinations"), checking similarity, and then marking the sufficiently similar ones as renames. This can be expensive if there are many sources and destinations on a given side of history as it results in an N x M comparison matrix. However, there are many cases where we can compute in advance that detecting renames for some of the sources provides no useful information and thus that we can exclude those sources from the matrix. To see why, first note that the merge machinery uses detected renames in two ways: * directory rename detection: when one side of history renames a directory, and the other side of history adds new files to that directory, we want to be able to warn the user about the need to chose whether those new files stay in the old directory or move to the new one. * three-way content merging: in order to do three-way content merging of files, we need three different file versions. If one side of history renamed a file, then some of the content for the file is found under a different path than in the merge base or on the other side of history. Add a simple testcase showing the two kinds of reasons renames are relevant; it's a testcase that will only pass if we detect both kinds of needed renames. Other than the testcase added above, this commit concentrates just on the three-way content merging; it will punt and mark all sources as needed for directory rename detection, and leave it to future commits to narrow that down more. The point of three-way content merging is to reconcile changes made on both sides of history. What if the file wasn't modified on both sides? There are two possibilities: * If it wasn't modified on the renamed side: -> then we get to do exact rename detection, which is cheap. * If it wasn't modified on the unrenamed side: -> then detection of a rename for that source file is irrelevant That latter claim might be surprising at first, so let's walk through a case to show why rename detection for that source file is irrelevant. Let's use two filenames, old.c & new.c, with the following abbreviated object ids (and where the value '000000' is used to denote that the file is missing in that commit): old.c new.c MERGE_BASE: 01d01d 000000 MERGE_SIDE1: 01d01d 000000 MERGE_SIDE2: 000000 5e1ec7 If the rename isn't detected: then old.c looks like it was unmodified on one side and deleted on the other and should thus be removed. new.c looks like a new file we should keep as-is. If the rename is detected: then a three-way content merge is done. Since the version of the file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge will produce exactly the version of the file whose abbreviated object id is 5e1ec7. It will record that file at the path new.c, while removing old.c from the directory. Note that these two results are identical -- a single file named 'new.c' with object id 5e1ec7. In other words, it doesn't matter if the rename is detected in the case where the file is unmodified on the unrenamed side. Use this information to compute whether we need rename detection for each source created in add_pair(). It's probably worth noting that there used to be a few other edge or corner cases besides three-way content merges and directory rename detection where lack of rename detection could have affected the result, but those cases actually highlighted where conflict resolution methods were not consistent with each other. Fixing those inconsistencies were thus critically important to enabling this optimization. That work involved the following: * bringing consistency to add/add, rename/add, and rename/rename conflict types, as done back in the topic merged at commit `ac193e0e0a` ("Merge branch 'en/merge-path-collision'", 2019-01-04), and further extended in commits `2a7c16c980` ("t6422, t6426: be more flexible for add/add conflicts involving renames", 2020-08-10) and `e8eb99d4a6` ("t642[23]: be more flexible for add/add conflicts involving pair renames", 2020-08-10) * making rename/delete more consistent with modify/delete as done in commits `1f3c9ba707` ("t6425: be more flexible with rename/delete conflict messages", 2020-08-10) and `727c75b23f` ("t6404, t6423: expect improved rename/delete handling in ort backend", 2020-10-26) Since the set of relevant_sources we compute has not yet been narrowed down for directory rename detection, we do not pass it to diffcore_rename_extended() yet. That will be done after subsequent commits narrow down the list of relevant_sources needed for directory rename detection reasons. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-10 22:18:04 -08:00
Elijah Newren	9799889f2e	diffcore-rename: enable filtering possible rename sources Add the ability to diffcore_rename_extended() to allow external callers to declare that they only need renames detected for a subset of source files, and use that information to skip detecting renames for them. There are two important pieces to this optimization that may not be obvious at first glance: * We do not require callers to just filter the filepairs out to remove the non-relevant sources, because exact rename detection is fast and when it finds a match it can remove both a source and a destination whereas the relevant_sources filter can only remove a source. * We need to filter out the source pairs in a preliminary pass instead of adding a strset_contains(relevant_sources, one->path) check within the nested matrix loop. The reason for that is if we have 30k renames, doing 30k * 30k = 900M strset_contains() calls becomes extraordinarily expensive and defeats the performance gains from this change; we only want to do 30k such calls instead. If callers pass NULL for relevant_sources, that is special cases to treat all sources as relevant. Since all callers currently pass NULL, this optimization does not yet have any effect. Subsequent commits will have merge-ort compute a set of relevant_sources to restrict which sources we detect renames for, and have merge-ort pass that set of relevant_sources to diffcore_rename_extended(). A note about filtering order: Some may be curious why we don't filter out irrelevant sources at the same time we filter out exact renames. While that technically could be done at this point, there are two reasons to defer it: First, was to reinforce a lesson that was too easy to forget. As I mentioned above, in the past I filtered irrelevant sources out before exact rename checking, and then discovered that exact renames' ability to remove both sources and destinations was an important consideration and thus doing the filtering after exact rename checking would speed things up. Then at some point I realized that basename matching could also remove both sources and destinations, and decided to put irrelevant source filtering after basename filtering. That slowed things down a lot. But, despite learning about this important ordering, in later restructuring I forgot and made the same mistake of putting the filtering after basename guided rename detection again. So, I have this series of patches structured to do the irrelevant filtering last to start to show how much extra it costs, and then add relevant filtering in to find_basename_matches() to show how much it speeds things up. Basically, it's a way to reinforce something that apparently was too easy to forget, and make sure the commit messages record this lesson. Second, the items in the "relevant_sources" in this patch series will include all sources that might be relevant. It has to be conservative and catch anything that might need a rename, but in the patch series after this one we'll find ways to weed out more of the might be relevant ones. Unfortunately, merge-ort does not have sufficient information to weed those ones out, and there isn't enough information at the time of filtering exact renames out to remove the extra ones either. It has to be deferred. So the deferral is in part to simplify some later additions. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-10 22:18:04 -08:00
Elijah Newren	81afdf7a2e	diffcore-rename: compute dir_rename_guess from dir_rename_counts dir_rename_counts has a mapping of a mapping, in particular, it has old_dir => { new_dir => count } We want a simple mapping of old_dir => new_dir based on which new_dir had the highest count for a given old_dir. Compute this and store it in dir_rename_guess. This is the final piece of the puzzle needed to make our guesses at which directory files have been moved to when basenames aren't unique. For the testcases mentioned in commit `557ac0350d` ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 12.775 s ± 0.062 s 12.596 s ± 0.061 s mega-renames: 188.754 s ± 0.284 s 130.465 s ± 0.259 s just-one-mega: 5.599 s ± 0.019 s 3.958 s ± 0.010 s Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:12 -08:00
Elijah Newren	333899e1e3	diffcore-rename: limit dir_rename_counts computation to relevant dirs We are using dir_rename_counts to count the number of other directories that files within a directory moved to. We only need this information for directories that disappeared, though, so we can return early from update_dir_rename_counts() for other paths. If dirs_removed is passed to diffcore_rename_extended(), then it provides the relevant bits of information for us to limit this counting to relevant dirs. If dirs_removed is not passed, we would need to compute some replacement in order to do this limiting. Introduce a new info->relevant_source_dirs variable for this purpose, even though at this stage we will only set it to dirs_removed for simplicity. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:12 -08:00
Elijah Newren	1ad69eb0dc	diffcore-rename: compute dir_rename_counts in stages Compute dir_rename_counts based just on exact renames to start, as that can provide us useful information in find_basename_matches(). This is done by moving the code from compute_dir_rename_counts() into initialize_dir_rename_info(), resulting in it being computed earlier and based just on exact renames. Since that's an incomplete result, we augment the counts via calling update_dir_rename_counts() after each basename-guide and inexact rename detection match is found. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:12 -08:00
Elijah Newren	b1473019e8	diffcore-rename: extend cleanup_dir_rename_info() When diffcore_rename_extended() is passed a NULL dir_rename_count, we will still want to create a temporary one for use by find_basename_matches(), but have it fully deallocated before diffcore_rename_extended() returns. However, when diffcore_rename_extended() is passed a dir_rename_count, we want to fill that strmap with appropriate values and return it. However, for our interim purposes we may also add entries corresponding to directories that cannot have been renamed due to still existing on both sides. Extend cleanup_dir_rename_info() to handle these two different cases, cleaning up the relevant bits of information for each case. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:12 -08:00
Elijah Newren	b6e3d27434	diffcore-rename: move dir_rename_counts into dir_rename_info struct This continues the migration of the directory rename detection code into diffcore-rename, now taking the simple step of combining it with the dir_rename_info struct. Future commits will then make dir_rename_counts be computed in stages, and add computation of dir_rename_guess. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	cd52e0050f	diffcore-rename: add function for clearing dir_rename_count As we adjust the usage of dir_rename_count we want to have a function for clearing, or partially clearing it out. Add a partial_clear_dir_rename_count() function for this purpose. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	0c4fd732f0	Move computation of dir_rename_count from merge-ort to diffcore-rename Move the computation of dir_rename_count from merge-ort.c to diffcore-rename.c, making slight adjustments to the data structures based on the move. While the diffstat looks large, viewing this commit with --color-moved makes it clear that only about 20 lines changed. With this patch, the computation of dir_rename_count is still only done after inexact rename detection, but subsequent commits will add a preliminary computation of dir_rename_count after exact rename detection, followed by some updates after inexact rename detection. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	ae8cf74d3f	diffcore-rename: add a mapping of destination names to their indices Compute a mapping of full filename to the index within rename_dst where that filename is found, and store it in idx_map. idx_possible_rename() needs this to quickly finding an array entry in rename_dst given the pathname. While at it, add placeholder initializations for dir_rename_count and dir_rename_guess; these will be more fully populated in subsequent commits. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	bde8b9f34c	diffcore-rename: provide basic implementation of idx_possible_rename() Add a new struct dir_rename_info with various values we need inside our idx_possible_rename() function introduced in the previous commit. Add a basic implementation for this function showing how we plan to use the variables, but which will just return early with a value of -1 (not found) when those variables are not set up. Future commits will do the work necessary to set up those other variables so that idx_possible_rename() does not always return -1. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	37a2514364	diffcore-rename: use directory rename guided basename comparisons A previous commit noted that it is very common for people to move files across directories while keeping their filename the same. The last few commits took advantage of this and showed that we can accelerate rename detection significantly using basenames; since files with the same basename serve as likely rename candidates, we can check those first and remove them from the rename candidate pool if they are sufficiently similar. Unfortunately, the previous optimization was limited by the fact that the remaining basenames after exact rename detection are not always unique. Many repositories have hundreds of build files with the same name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have hundreds of source files with the same name. (For example, the linux kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at $DAYJOB has a lot of ObjectFactory.java and Plugin.java files). For these files with non-unique basenames, we are faced with the task of attempting to determine or guess which directory they may have been relocated to. Such a task is precisely the job of directory rename detection. However, there are two catches: (1) the directory rename detection code has traditionally been part of the merge machinery rather than diffcore-rename.c, and (2) directory rename detection currently runs after regular rename detection is complete. The 1st catch is just an implementation issue that can be overcome by some code shuffling. The 2nd requires us to add a further approximation: we only have access to exact renames at this point, so we need to do directory rename detection based on just exact renames. In some cases we won't have exact renames, in which case this extra optimization won't apply. We also choose to not apply the optimization unless we know that the underlying directory was removed, which will require extra data to be passed in to diffcore_rename_extended(). Also, even if we get a prediction about which directory a file may have relocated to, we will still need to check to see if there is a file in the predicted directory, and then compare the two files to see if they meet the higher min_basename_score threshold required for marking the two files as renames. This commit introduces an idx_possible_rename() function which will do this directory rename detection for us and give us the index within rename_dst of the resulting filename. For now, this function is hardcoded to return -1 (not found) and just hooks up how its results would be used once we have a more complete implementation in place. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-26 17:53:11 -08:00
Elijah Newren	f78cf97617	merge-ort: call diffcore_rename() directly We want to pass additional information to diffcore_rename() (or some variant thereof) without plumbing that extra information through diff_tree_oid() and diffcore_std(). Further, since we will need to gather additional special information related to diffs and are walking the trees anyway in collect_merge_info(), it seems odd to have diff_tree_oid()/diffcore_std() repeat those tree walks. And there may be times where we can avoid traversing into a subtree in collect_merge_info() (based on additional information at our disposal), that the basic diff logic would be unable to take advantage of. For all these reasons, just create the add and delete pairs ourself and then call diffcore_rename() directly. This change is primarily about enabling future optimizations; the advantage of avoiding extra tree traversals is small compared to the cost of rename detection, and the advantage of avoiding the extra tree traversals is somewhat offset by the extra time spent in collect_merge_info() collecting the additional data anyway. However... For the testcases mentioned in commit `557ac0350d` ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.294 s ± 0.103 s 12.775 s ± 0.062 s mega-renames: 187.248 s ± 0.882 s 188.754 s ± 0.284 s just-one-mega: 5.557 s ± 0.017 s 5.599 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	07c9a7fcb5	gitdiffcore doc: mention new preliminary step for rename detection The last few patches have introduced a new preliminary step when rename detection is on but both break detection and copy detection are off. Document this new step. While we're at it, add a testcase that checks the new behavior as well. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	bd24aa2f97	diffcore-rename: guide inexact rename detection based on basenames Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames. As a quick reminder (see the last two commit messages for more details), this means for example that docs/extensions.txt and docs/config/extensions.txt are considered likely renames if there are no remaining 'extensions.txt' files elsewhere among the added and deleted files, and if a similarity check confirms they are similar, then they are marked as a rename without looking for a better similarity match among other files. This is a behavioral change, as covered in more detail in the previous commit message. We do not use this heuristic together with either break or copy detection. The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity. The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities. So the idea behind this optimization goes against both of those features, and will be turned off for both. For the testcases mentioned in commit `557ac0350d` ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	da09f65127	diffcore-rename: complete find_basename_matches() It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e. most "renames" are just a move of files into different directories. We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames. If two files with the same basename are sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. This means we are adding a set of preliminary additional comparisons, but for each file we only compare it with at most one other file. For example, if there was a include/media/device.h that was deleted and a src/module/media/device.h that was added, and there are no other device.h files in the remaining sets of added and deleted files after exact rename detection, then these two files would be compared in the preliminary step. This commit does not yet actually employ this new optimization, it merely adds a function which can be used for this purpose. The next commit will do the necessary plumbing to make use of it. Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think that is okay for four reasons: (1) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out), (2) as the next patch will show it provides such a large performance boost that it's worth the tradeoff, and (3) it's somewhat unlikely that despite having unique matching basenames that other files serve as better matches. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename detection already does. Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar. In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to. Basically, we override content similarity based on sufficient filename similarity. Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents. This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well. Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	a35df3371c	diffcore-rename: compute basenames of source and dest candidates We want to make use of unique basenames among remaining source and destination files to help inform rename detection, so that more likely pairings can be checked first. (src/moduleA/foo.txt and source/module/A/foo.txt are likely related if there are no other 'foo.txt' files among the remaining deleted and added files.) Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). This function was inspired by the fact that in real world repositories, files are often moved across directories without changing names. Here are some sample repositories and the percentage of their historical renames (as of early 2020) that preserved basenames: * linux: 76% * gcc: 64% * gecko: 79% * webkit: 89% These statistics alone don't prove that an optimization in this area will help or how much it will help, since there are also unpaired adds and deletes, restrictions on which basenames we consider, etc., but it certainly motivated the idea to try something in this area. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	f3845257a5	t4001: add a test comparing basename similarity and content similarity Add a simple test where a removed file is similar to two different added files; one of them has the same basename, and the other has a slightly higher content similarity. In the current test, content similarity is weighted higher than filename similarity. Subsequent commits will add a new rule that weighs a mixture of filename similarity and content similarity in a manner that will change the outcome of this testcase. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	829514c515	diffcore-rename: filter rename_src list when possible We have to look at each entry in rename_src a total of rename_dst_nr times. When we're not detecting copies, any exact renames or ignorable rename paths will just be skipped over. While checking that these can be skipped over is a relatively cheap check, it's still a waste of time to do that check more than once, let alone rename_dst_nr times. When rename_src_nr is a few thousand times bigger than the number of relevant sources (such as when cherry-picking a commit that only touched a handful of files, but from a side of history that has different names for some high level directories), this time can add up. First make an initial pass over the rename_src array and move all the relevant entries to the front, so that we can iterate over just those relevant entries. For the testcases mentioned in commit `557ac0350d` ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.119 s ± 0.101 s 13.815 s ± 0.062 s mega-renames: 1802.044 s ± 0.828 s 1799.937 s ± 0.493 s just-one-mega: 51.391 s ± 0.028 s 51.289 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-15 18:02:16 -08:00
Elijah Newren	f15eb7c1cf	diffcore-rename: no point trying to find a match better than exact diffcore_rename() had some code to avoid having destination paths that already had an exact rename detected from being re-checked for other renames. Source paths, however, were re-checked because we wanted to allow the possibility of detecting copies. But if copy detection isn't turned on, then this merely amounts to attempting to find a better-than-exact match, which naturally ends up being an expensive no-op. In particular, copy detection is never turned on by the merge machinery. For the testcases mentioned in commit `557ac0350d` ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 14.263 s ± 0.053 s 14.119 s ± 0.101 s mega-renames: 5504.231 s ± 5.150 s 1802.044 s ± 0.828 s just-one-mega: 158.534 s ± 0.498 s 51.391 s ± 0.028 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-12 12:04:00 -08:00
Junio C Hamano	f011795891	Sync with maint	2021-02-11 13:58:52 -08:00
Junio C Hamano	d3a035b055	Merge branch 'en/merge-ort-perf' The "ort" merge strategy. * en/merge-ort-perf: merge-ort: begin performance work; instrument with trace2_region_* calls merge-ort: ignore the directory rename split conflict for now merge-ort: fix massive leak	2021-02-11 13:58:44 -08:00
Junio C Hamano	a21e27ef6b	Merge branch 'en/ort-directory-rename' ORT merge strategy learns to infer "renamed directory" while merging. * en/ort-directory-rename: merge-ort: fix a directory rename detection bug merge-ort: process_renames() now needs more defensiveness merge-ort: implement apply_directory_rename_modifications() merge-ort: add a new toplevel_dir field merge-ort: implement handle_path_level_conflicts() merge-ort: implement check_for_directory_rename() merge-ort: implement apply_dir_rename() and check_dir_renamed() merge-ort: implement compute_collisions() merge-ort: modify collect_renames() for directory rename handling merge-ort: implement handle_directory_level_conflicts() merge-ort: implement compute_rename_counts() merge-ort: copy get_renamed_dir_portion() from merge-recursive.c merge-ort: add outline of get_provisional_directory_renames() merge-ort: add outline for computing directory renames merge-ort: collect which directories are removed in dirs_removed merge-ort: initialize and free new directory rename data structures merge-ort: add new data structures for directory rename detection	2021-02-11 13:58:43 -08:00
Junio C Hamano	59ec22464f	Merge branch 'tb/ci-run-cocci-with-18.04' into maint * tb/ci-run-cocci-with-18.04: .github/workflows/main.yml: run static-analysis on bionic	2021-02-11 13:57:36 -08:00
Junio C Hamano	c6102b7585	Merge branch 'tb/ci-run-cocci-with-18.04' The version of Ubuntu Linux used by default at GitHub Actions CI has been updated to one that lack coccinelle; until it gets fixed, work it around by sticking to the previous release (18.04). * tb/ci-run-cocci-with-18.04: .github/workflows/main.yml: run static-analysis on bionic	2021-02-10 16:48:07 -08:00
Junio C Hamano	f9f2520108	The seventh batch Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-10 14:48:33 -08:00
Junio C Hamano	466f94ec45	Merge branch 'ab/detox-gettext-tests' Get rid of "GETTEXT_POISON" support altogether, which may or may not be controversial. * ab/detox-gettext-tests: tests: remove uses of GIT_TEST_GETTEXT_POISON=false tests: remove support for GIT_TEST_GETTEXT_POISON ci: remove GETTEXT_POISON jobs	2021-02-10 14:48:33 -08:00
Junio C Hamano	59ace284f3	Merge branch 'ab/grep-pcre-invalid-utf8' Update support for invalid UTF-8 in PCRE2. * ab/grep-pcre-invalid-utf8: grep/pcre2: better support invalid UTF-8 haystacks grep/pcre2 tests: don't rely on invalid UTF-8 data test	2021-02-10 14:48:33 -08:00
Junio C Hamano	0199c68d01	Merge branch 'ab/retire-pcre1' The support for deprecated PCRE1 library has been dropped. * ab/retire-pcre1: Remove support for v1 of the PCRE library config.mak.uname: remove redundant NO_LIBPCRE1_JIT flag	2021-02-10 14:48:33 -08:00
Junio C Hamano	938ecaa42f	Merge branch 'jk/pretty-lazy-load-commit' Some pretty-format specifiers do not need the data in commit object (e.g. "%H"), but we were over-eager to load and parse it, which has been made even lazier. * jk/pretty-lazy-load-commit: pretty: lazy-load commit data when expanding user-format	2021-02-10 14:48:33 -08:00
Junio C Hamano	2f794620f5	Merge branch 'ds/more-index-cleanups' Cleaning various codepaths up. * ds/more-index-cleanups: t1092: test interesting sparse-checkout scenarios test-lib: test_region looks for trace2 regions sparse-checkout: load sparse-checkout patterns name-hash: use trace2 regions for init repository: add repo reference to index_state fsmonitor: de-duplicate BUG()s around dirty bits cache-tree: extract subtree_pos() cache-tree: simplify verify_cache() prototype cache-tree: clean up cache_tree_update()	2021-02-10 14:48:33 -08:00
Junio C Hamano	02fb21617e	Merge branch 'rs/worktree-list-verbose' `git worktree list` now annotates worktrees as prunable, shows locked and prunable attributes in --porcelain mode, and gained a --verbose option. * rs/worktree-list-verbose: worktree: teach `list` verbose mode worktree: teach `list` to annotate prunable worktree worktree: teach `list --porcelain` to annotate locked worktree t2402: ensure locked worktree is properly cleaned up worktree: teach worktree_lock_reason() to gently handle main worktree worktree: teach worktree to lazy-load "prunable" reason worktree: libify should_prune_worktree()	2021-02-10 14:48:32 -08:00
Junio C Hamano	7e94720c1e	Merge branch 'js/rebase-i-commit-cleanup-fix' When "git rebase -i" processes "fixup" insn, there is no reason to clean up the commit log message, but we did the usual stripspace processing. This has been corrected. * js/rebase-i-commit-cleanup-fix: rebase -i: do leave commit message intact in fixup! chains	2021-02-10 14:48:32 -08:00
Junio C Hamano	e5abed92f5	Merge branch 'jk/t0000-cleanups' Code clean-up. * jk/t0000-cleanups: t0000: consistently use single quotes for outer tests t0000: run cleaning test inside sub-test t0000: run prereq tests inside sub-test t0000: keep clean-up tests together	2021-02-10 14:48:32 -08:00
Junio C Hamano	04703f64be	Merge branch 'sg/t7800-difftool-robustify' Test fix. * sg/t7800-difftool-robustify: t7800-difftool: don't accidentally match tmp dirs	2021-02-10 14:48:32 -08:00
Junio C Hamano	c9f94ab4fa	Merge branch 'ab/lose-grep-debug' Lose the debugging aid that may have been useful in the past, but no longer is, in the "grep" codepaths. * ab/lose-grep-debug: grep/log: remove hidden --debug and --grep-debug options	2021-02-10 14:48:31 -08:00
Junio C Hamano	9d5b1c06ac	Merge branch 'jk/use-oid-pos' Code clean-up to ensure our use of hashtables using object names as keys use the "struct object_id" objects, not the raw hash values. * jk/use-oid-pos: oid_pos(): access table through const pointers hash_pos(): convert to oid_pos() rerere: use strmap to store rerere directories rerere: tighten rr-cache dirname check rerere: check dirname format while iterating rr_cache directory commit_graft_pos(): take an oid instead of a bare hash	2021-02-10 14:48:31 -08:00
Junio C Hamano	1d4f2316c5	Sync with 2.30.1	2021-02-08 14:44:42 -08:00
Taylor Blau	d051ed77ee	.github/workflows/main.yml: run static-analysis on bionic GitHub Actions is transitioning workflow steps that run on 'ubuntu-latest' from 18.04 to 20.04 [1]. This works fine in all steps except the static-analysis one, since Coccinelle isn't available on Ubuntu focal (it is only available in the universe suite). Until Coccinelle can be installed from 20.04's main suite, pin the static-analysis build to run on 18.04, where it can be installed by default. [1]: https://github.com/actions/virtual-environments/issues/1816 Reported-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-08 14:38:07 -08:00
Junio C Hamano	773e25afc4	Git 2.30.1 Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-08 14:05:55 -08:00
Junio C Hamano	dadf9e519d	Merge branch 'pb/ci-matrix-wo-shortcut' into maint Our setting of GitHub CI test jobs were a bit too eager to give up once there is even one failure found. Tweak the knob to allow other jobs keep running even when we see a failure, so that we can find more failures in a single run. * pb/ci-matrix-wo-shortcut: ci: do not cancel all jobs of a matrix if one fails	2021-02-08 14:05:55 -08:00
Junio C Hamano	f20aeed235	Merge branch 'pb/blame-funcname-range-userdiff' into maint Test fix. * pb/blame-funcname-range-userdiff: annotate-tests: quote variable expansions containing path names	2021-02-08 14:05:55 -08:00
Junio C Hamano	6a7bf0ddb2	Merge branch 'jk/p5303-sed-portability-fix' into maint A perf script was made more portable. * jk/p5303-sed-portability-fix: p5303: avoid sed GNU-ism	2021-02-08 14:05:55 -08:00
Junio C Hamano	f2d156dc48	Merge branch 'ab/branch-sort' into maint The implementation of "git branch --sort" wrt the detached HEAD display has always been hacky, which has been cleaned up. * ab/branch-sort: branch: show "HEAD detached" first under reverse sort branch: sort detached HEAD based on a flag ref-filter: move ref_sorting flags to a bitfield ref-filter: move "cmp_fn" assignment into "else if" arm ref-filter: add braces to if/else if/else chain branch tests: add to --sort tests branch: change "--local" to "--list" in comment	2021-02-08 14:05:55 -08:00
Junio C Hamano	171675a6c5	Merge branch 'ma/more-opaque-lock-file' into maint Code clean-up. * ma/more-opaque-lock-file: read-cache: try not to peek into `struct {lock_,temp}file` refs/files-backend: don't peek into `struct lock_file` midx: don't peek into `struct lock_file` commit-graph: don't peek into `struct lock_file` builtin/gc: don't peek into `struct lock_file`	2021-02-08 14:05:55 -08:00
Junio C Hamano	6a20b9b9ef	Merge branch 'dl/p4-encode-after-kw-expansion' into maint Text encoding fix for "git p4". * dl/p4-encode-after-kw-expansion: git-p4: fix syncing file types with pattern	2021-02-08 14:05:54 -08:00
Junio C Hamano	f0e3c7f831	Merge branch 'ar/t6016-modernise' into maint Test update. * ar/t6016-modernise: t6016: move to lib-log-graph.sh framework	2021-02-08 14:05:54 -08:00

1 2 3 4 5 ...

62042 Commits