2005-05-21 11:39:09 +02:00
|
|
|
/*
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
*
|
2005-05-21 11:39:09 +02:00
|
|
|
* Copyright (C) 2005 Junio C Hamano
|
|
|
|
*/
|
|
|
|
#include "cache.h"
|
|
|
|
#include "diff.h"
|
|
|
|
#include "diffcore.h"
|
2018-05-16 01:42:15 +02:00
|
|
|
#include "object-store.h"
|
2013-11-14 20:20:26 +01:00
|
|
|
#include "hashmap.h"
|
2011-02-20 10:51:16 +01:00
|
|
|
#include "progress.h"
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
#include "promisor-remote.h"
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
#include "strmap.h"
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
/* Table of rename/copy destinations */
|
|
|
|
|
|
|
|
static struct diff_rename_dst {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filepair *p;
|
|
|
|
struct diff_filespec *filespec_to_free;
|
|
|
|
int is_rename; /* false -> just a create; true -> rename or copy */
|
2005-05-24 10:10:48 +02:00
|
|
|
} *rename_dst;
|
|
|
|
static int rename_dst_nr, rename_dst_alloc;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Mapping from break source pathname to break destination index */
|
|
|
|
static struct strintmap *break_idx = NULL;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
|
2015-02-27 02:39:48 +01:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Lookup by p->ONE->path */
|
|
|
|
int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
|
|
|
|
return (idx == -1) ? NULL : &rename_dst[idx];
|
2015-02-27 02:39:48 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns 0 on success, -1 if we found a duplicate.
|
|
|
|
*/
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
static int add_rename_dst(struct diff_filepair *p)
|
2015-02-27 02:39:48 +01:00
|
|
|
{
|
2014-03-03 23:31:54 +01:00
|
|
|
ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[rename_dst_nr].p = p;
|
|
|
|
rename_dst[rename_dst_nr].filespec_to_free = NULL;
|
|
|
|
rename_dst[rename_dst_nr].is_rename = 0;
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_dst_nr++;
|
2015-02-27 02:39:48 +01:00
|
|
|
return 0;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
2005-05-28 00:55:55 +02:00
|
|
|
/* Table of rename/copy src files */
|
2005-05-24 10:10:48 +02:00
|
|
|
static struct diff_rename_src {
|
2011-01-06 22:50:05 +01:00
|
|
|
struct diff_filepair *p;
|
2006-04-09 05:17:46 +02:00
|
|
|
unsigned short score; /* to remember the break score */
|
2005-05-24 10:10:48 +02:00
|
|
|
} *rename_src;
|
|
|
|
static int rename_src_nr, rename_src_alloc;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2020-12-11 10:08:46 +01:00
|
|
|
static void register_rename_src(struct diff_filepair *p)
|
2005-05-24 10:10:48 +02:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (p->broken_pair) {
|
|
|
|
if (!break_idx) {
|
|
|
|
break_idx = xmalloc(sizeof(*break_idx));
|
|
|
|
strintmap_init(break_idx, -1);
|
|
|
|
}
|
|
|
|
strintmap_set(break_idx, p->one->path, rename_dst_nr);
|
|
|
|
}
|
|
|
|
|
2014-03-03 23:31:54 +01:00
|
|
|
ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
|
2020-12-11 10:08:46 +01:00
|
|
|
rename_src[rename_src_nr].p = p;
|
|
|
|
rename_src[rename_src_nr].score = p->score;
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_src_nr++;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
2007-06-21 13:52:11 +02:00
|
|
|
static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
|
|
|
|
{
|
|
|
|
int src_len = strlen(src->path), dst_len = strlen(dst->path);
|
|
|
|
while (src_len && dst_len) {
|
|
|
|
char c1 = src->path[--src_len];
|
|
|
|
char c2 = dst->path[--dst_len];
|
|
|
|
if (c1 != c2)
|
|
|
|
return 0;
|
|
|
|
if (c1 == '/')
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return (!src_len || src->path[src_len - 1] == '/') &&
|
|
|
|
(!dst_len || dst->path[dst_len - 1] == '/');
|
|
|
|
}
|
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_score {
|
2005-05-24 10:10:48 +02:00
|
|
|
int src; /* index in rename_src */
|
|
|
|
int dst; /* index in rename_dst */
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
unsigned short score;
|
|
|
|
short name_score;
|
2005-05-21 11:39:09 +02:00
|
|
|
};
|
|
|
|
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
struct prefetch_options {
|
|
|
|
struct repository *repo;
|
|
|
|
int skip_unmodified;
|
|
|
|
};
|
|
|
|
static void prefetch(void *prefetch_options)
|
|
|
|
{
|
|
|
|
struct prefetch_options *options = prefetch_options;
|
|
|
|
int i;
|
|
|
|
struct oid_array to_fetch = OID_ARRAY_INIT;
|
|
|
|
|
|
|
|
for (i = 0; i < rename_dst_nr; i++) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].p->renamed_pair)
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
/*
|
|
|
|
* The loop in diffcore_rename() will not need these
|
|
|
|
* blobs, so skip prefetching.
|
|
|
|
*/
|
|
|
|
continue; /* already found exact match */
|
|
|
|
diff_add_if_missing(options->repo, &to_fetch,
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[i].p->two);
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
}
|
|
|
|
for (i = 0; i < rename_src_nr; i++) {
|
|
|
|
if (options->skip_unmodified &&
|
|
|
|
diff_unmodified_pair(rename_src[i].p))
|
|
|
|
/*
|
|
|
|
* The loop in diffcore_rename() will not need these
|
|
|
|
* blobs, so skip prefetching.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
diff_add_if_missing(options->repo, &to_fetch,
|
|
|
|
rename_src[i].p->one);
|
|
|
|
}
|
|
|
|
promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr);
|
|
|
|
oid_array_clear(&to_fetch);
|
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static int estimate_similarity(struct repository *r,
|
|
|
|
struct diff_filespec *src,
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_filespec *dst,
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
int minimum_score,
|
|
|
|
int skip_unmodified)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
|
|
|
/* src points at a file that existed in the original tree (or
|
|
|
|
* optionally a file in the destination tree) and dst points
|
|
|
|
* at a newly created file. They may be quite similar, in which
|
|
|
|
* case we want to say src is renamed to dst or src is copied into
|
|
|
|
* dst, and then some edit has been applied to dst.
|
|
|
|
*
|
|
|
|
* Compare them and return how similar they are, representing
|
2005-05-28 00:56:38 +02:00
|
|
|
* the score as an integer between 0 and MAX_SCORE.
|
|
|
|
*
|
|
|
|
* When there is an exact match, it is considered a better
|
|
|
|
* match than anything else; the destination does not even
|
|
|
|
* call into this function in that case.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2006-03-13 07:26:34 +01:00
|
|
|
unsigned long max_size, delta_size, base_size, src_copied, literal_added;
|
2005-05-21 11:39:09 +02:00
|
|
|
int score;
|
2020-04-08 00:11:41 +02:00
|
|
|
struct diff_populate_filespec_options dpf_options = {
|
|
|
|
.check_size_only = 1
|
|
|
|
};
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
struct prefetch_options prefetch_options = {r, skip_unmodified};
|
|
|
|
|
|
|
|
if (r == the_repository && has_promisor_remote()) {
|
|
|
|
dpf_options.missing_object_cb = prefetch;
|
|
|
|
dpf_options.missing_object_data = &prefetch_options;
|
|
|
|
}
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-23 06:24:49 +02:00
|
|
|
/* We deal only with regular files. Symlink renames are handled
|
|
|
|
* only when they are exact matches --- in other words, no edits
|
|
|
|
* after renaming.
|
|
|
|
*/
|
|
|
|
if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
|
|
|
|
return 0;
|
|
|
|
|
2007-10-27 01:51:28 +02:00
|
|
|
/*
|
|
|
|
* Need to check that source and destination sizes are
|
|
|
|
* filled in before comparing them.
|
|
|
|
*
|
|
|
|
* If we already have "cnt_data" filled in, we know it's
|
|
|
|
* all good (avoid checking the size for zero, as that
|
|
|
|
* is a possible size - we really should have a flag to
|
|
|
|
* say whether the size is valid or not!)
|
|
|
|
*/
|
2014-08-16 05:08:04 +02:00
|
|
|
if (!src->cnt_data &&
|
2020-04-08 00:11:41 +02:00
|
|
|
diff_populate_filespec(r, src, &dpf_options))
|
2007-10-27 01:51:28 +02:00
|
|
|
return 0;
|
2014-08-16 05:08:04 +02:00
|
|
|
if (!dst->cnt_data &&
|
2020-04-08 00:11:41 +02:00
|
|
|
diff_populate_filespec(r, dst, &dpf_options))
|
2007-10-27 01:51:28 +02:00
|
|
|
return 0;
|
|
|
|
|
2006-03-13 07:26:34 +01:00
|
|
|
max_size = ((src->size > dst->size) ? src->size : dst->size);
|
2005-05-22 00:55:18 +02:00
|
|
|
base_size = ((src->size < dst->size) ? src->size : dst->size);
|
2006-03-13 07:26:34 +01:00
|
|
|
delta_size = max_size - base_size;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-22 00:55:18 +02:00
|
|
|
/* We would not consider edits that change the file size so
|
|
|
|
* drastically. delta_size must be smaller than
|
2005-05-22 10:31:28 +02:00
|
|
|
* (MAX_SCORE-minimum_score)/MAX_SCORE * min(src->size, dst->size).
|
2005-05-28 00:56:38 +02:00
|
|
|
*
|
2005-05-22 00:55:18 +02:00
|
|
|
* Note that base_size == 0 case is handled here already
|
|
|
|
* and the final score computation below would not have a
|
|
|
|
* divide-by-zero issue.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2011-02-19 05:12:06 +01:00
|
|
|
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
|
2005-05-21 11:39:09 +02:00
|
|
|
return 0;
|
|
|
|
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
dpf_options.check_size_only = 0;
|
|
|
|
|
|
|
|
if (!src->cnt_data && diff_populate_filespec(r, src, &dpf_options))
|
2009-01-20 16:59:57 +01:00
|
|
|
return 0;
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
if (!dst->cnt_data && diff_populate_filespec(r, dst, &dpf_options))
|
2009-01-20 16:59:57 +01:00
|
|
|
return 0;
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
if (diffcore_count_changes(r, src, dst,
|
2006-03-12 12:22:10 +01:00
|
|
|
&src->cnt_data, &dst->cnt_data,
|
2006-03-01 01:01:36 +01:00
|
|
|
&src_copied, &literal_added))
|
2005-05-24 21:09:32 +02:00
|
|
|
return 0;
|
2005-06-03 10:36:03 +02:00
|
|
|
|
2006-03-03 07:11:25 +01:00
|
|
|
/* How similar are they?
|
|
|
|
* what percentage of material in dst are from source?
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2006-03-13 07:26:34 +01:00
|
|
|
if (!dst->size)
|
2006-03-03 07:11:25 +01:00
|
|
|
score = 0; /* should not happen */
|
2007-06-25 00:23:28 +02:00
|
|
|
else
|
2007-03-07 02:44:37 +01:00
|
|
|
score = (int)(src_copied * MAX_SCORE / max_size);
|
2005-05-21 11:39:09 +02:00
|
|
|
return score;
|
|
|
|
}
|
|
|
|
|
2005-09-16 01:13:43 +02:00
|
|
|
static void record_rename_pair(int dst_index, int src_index, int score)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filepair *src = rename_src[src_index].p;
|
|
|
|
struct diff_filepair *dst = rename_dst[dst_index].p;
|
[PATCH] Rename/copy detection fix.
The rename/copy detection logic in earlier round was only good
enough to show patch output and discussion on the mailing list
about the diff-raw format updates revealed many problems with
it. This patch fixes all the ones known to me, without making
things I want to do later impossible, mostly related to patch
reordering.
(1) Earlier rename/copy detector determined which one is rename
and which one is copy too early, which made it impossible
to later introduce diffcore transformers to reorder
patches. This patch fixes it by moving that logic to the
very end of the processing.
(2) Earlier output routine diff_flush() was pruning all the
"no-change" entries indiscriminatingly. This was done due
to my false assumption that one of the requirements in the
diff-raw output was not to show such an entry (which
resulted in my incorrect comment about "diff-helper never
being able to be equivalent to built-in diff driver"). My
special thanks go to Linus for correcting me about this.
When we produce diff-raw output, for the downstream to be
able to tell renames from copies, sometimes it _is_
necessary to output "no-change" entries, and this patch
adds diffcore_prune() function for doing it.
(3) Earlier diff_filepair structure was trying to be not too
specific about rename/copy operations, but the purpose of
the structure was to record one or two paths, which _was_
indeed about rename/copy. This patch discards xfrm_msg
field which was trying to be generic for this wrong reason,
and introduces a couple of fields (rename_score and
rename_rank) that are explicitly specific to rename/copy
logic. One thing to note is that the information in a
single diff_filepair structure _still_ does not distinguish
renames from copies, and it is deliberately so. This is to
allow patches to be reordered in later stages.
(4) This patch also adds some tests about diff-raw format
output and makes sure that necessary "no-change" entries
appear on the output.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-23 06:26:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (dst->renamed_pair)
|
2005-05-24 10:10:48 +02:00
|
|
|
die("internal error: dst already matched.");
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
src->one->rename_used++;
|
|
|
|
src->one->count++;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[dst_index].filespec_to_free = dst->one;
|
|
|
|
rename_dst[dst_index].is_rename = 1;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
dst->one = src->one;
|
|
|
|
dst->renamed_pair = 1;
|
|
|
|
if (!strcmp(dst->one->path, dst->two->path))
|
|
|
|
dst->score = rename_src[src_index].score;
|
2006-04-09 05:17:46 +02:00
|
|
|
else
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
dst->score = score;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We sort the rename similarity matrix with the score, in descending
|
2005-05-28 00:55:55 +02:00
|
|
|
* order (the most similar first).
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
|
|
|
static int score_compare(const void *a_, const void *b_)
|
|
|
|
{
|
|
|
|
const struct diff_score *a = a_, *b = b_;
|
2007-06-25 00:23:28 +02:00
|
|
|
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
/* sink the unused ones to the bottom */
|
|
|
|
if (a->dst < 0)
|
|
|
|
return (0 <= b->dst);
|
|
|
|
else if (b->dst < 0)
|
|
|
|
return -1;
|
|
|
|
|
2007-06-25 00:23:28 +02:00
|
|
|
if (a->score == b->score)
|
|
|
|
return b->name_score - a->name_score;
|
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
return b->score - a->score;
|
|
|
|
}
|
|
|
|
|
2007-10-25 20:23:26 +02:00
|
|
|
struct file_similarity {
|
2013-11-14 20:20:26 +01:00
|
|
|
struct hashmap_entry entry;
|
2013-11-14 20:19:34 +01:00
|
|
|
int index;
|
2007-10-25 20:23:26 +02:00
|
|
|
struct diff_filespec *filespec;
|
|
|
|
};
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static unsigned int hash_filespec(struct repository *r,
|
|
|
|
struct diff_filespec *filespec)
|
2013-11-14 20:19:04 +01:00
|
|
|
{
|
2016-06-25 01:09:24 +02:00
|
|
|
if (!filespec->oid_valid) {
|
2020-04-08 00:11:41 +02:00
|
|
|
if (diff_populate_filespec(r, filespec, NULL))
|
2013-11-14 20:19:04 +01:00
|
|
|
return 0;
|
2020-01-30 21:32:22 +01:00
|
|
|
hash_object_file(r->hash_algo, filespec->data, filespec->size,
|
|
|
|
"blob", &filespec->oid);
|
2013-11-14 20:19:04 +01:00
|
|
|
}
|
2019-06-20 09:41:49 +02:00
|
|
|
return oidhash(&filespec->oid);
|
2013-11-14 20:19:04 +01:00
|
|
|
}
|
|
|
|
|
2013-11-14 20:20:26 +01:00
|
|
|
static int find_identical_files(struct hashmap *srcs,
|
2013-11-14 20:19:34 +01:00
|
|
|
int dst_index,
|
2011-02-19 04:55:19 +01:00
|
|
|
struct diff_options *options)
|
2007-10-25 20:23:26 +02:00
|
|
|
{
|
|
|
|
int renames = 0;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filespec *target = rename_dst[dst_index].p->two;
|
2014-07-03 00:22:11 +02:00
|
|
|
struct file_similarity *p, *best = NULL;
|
2013-11-14 20:19:04 +01:00
|
|
|
int i = 100, best_score = -1;
|
2019-10-07 01:30:35 +02:00
|
|
|
unsigned int hash = hash_filespec(options->repo, target);
|
2013-11-14 20:19:04 +01:00
|
|
|
|
|
|
|
/*
|
2013-11-14 20:19:34 +01:00
|
|
|
* Find the best source match for specified destination.
|
2013-11-14 20:19:04 +01:00
|
|
|
*/
|
2019-10-07 01:30:35 +02:00
|
|
|
p = hashmap_get_entry_from_hash(srcs, hash, NULL,
|
|
|
|
struct file_similarity, entry);
|
2019-10-07 01:30:41 +02:00
|
|
|
hashmap_for_each_entry_from(srcs, p, entry) {
|
2013-11-14 20:19:04 +01:00
|
|
|
int score;
|
|
|
|
struct diff_filespec *source = p->filespec;
|
|
|
|
|
|
|
|
/* False hash collision? */
|
2018-08-28 23:22:48 +02:00
|
|
|
if (!oideq(&source->oid, &target->oid))
|
2013-11-14 20:19:04 +01:00
|
|
|
continue;
|
|
|
|
/* Non-regular files? If so, the modes must match! */
|
|
|
|
if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
|
|
|
|
if (source->mode != target->mode)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue;
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
2013-11-14 20:19:04 +01:00
|
|
|
/* Give higher scores to sources that haven't been used already */
|
|
|
|
score = !source->rename_used;
|
|
|
|
if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
|
|
|
|
continue;
|
|
|
|
score += basename_same(source, target);
|
|
|
|
if (score > best_score) {
|
|
|
|
best = p;
|
|
|
|
best_score = score;
|
|
|
|
if (score == 2)
|
|
|
|
break;
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
2013-11-14 20:19:04 +01:00
|
|
|
|
|
|
|
/* Too many identical alternatives? Pick one */
|
|
|
|
if (!--i)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (best) {
|
2013-11-14 20:19:34 +01:00
|
|
|
record_rename_pair(dst_index, best->index, MAX_SCORE);
|
2013-11-14 20:19:04 +01:00
|
|
|
renames++;
|
|
|
|
}
|
2007-10-25 20:23:26 +02:00
|
|
|
return renames;
|
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static void insert_file_table(struct repository *r,
|
|
|
|
struct hashmap *table, int index,
|
|
|
|
struct diff_filespec *filespec)
|
2007-10-25 20:23:26 +02:00
|
|
|
{
|
|
|
|
struct file_similarity *entry = xmalloc(sizeof(*entry));
|
|
|
|
|
|
|
|
entry->index = index;
|
|
|
|
entry->filespec = filespec;
|
|
|
|
|
2019-10-07 01:30:27 +02:00
|
|
|
hashmap_entry_init(&entry->entry, hash_filespec(r, filespec));
|
2019-10-07 01:30:29 +02:00
|
|
|
hashmap_add(table, &entry->entry);
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
|
|
|
|
2007-10-25 20:17:55 +02:00
|
|
|
/*
|
|
|
|
* Find exact renames first.
|
|
|
|
*
|
|
|
|
* The first round matches up the up-to-date entries,
|
|
|
|
* and then during the second round we try to match
|
|
|
|
* cache-dirty entries as well.
|
|
|
|
*/
|
2011-02-19 04:55:19 +01:00
|
|
|
static int find_exact_renames(struct diff_options *options)
|
2007-10-25 20:17:55 +02:00
|
|
|
{
|
2013-11-14 20:19:34 +01:00
|
|
|
int i, renames = 0;
|
2013-11-14 20:20:26 +01:00
|
|
|
struct hashmap file_table;
|
2007-10-25 20:17:55 +02:00
|
|
|
|
diffcore: fix iteration order of identical files during rename detection
If the two paths 'dir/A/file' and 'dir/B/file' have identical content
and the parent directory is renamed, e.g. 'git mv dir other-dir', then
diffcore reports the following exact renames:
renamed: dir/B/file -> other-dir/A/file
renamed: dir/A/file -> other-dir/B/file
While technically not wrong, this is confusing not only for the user,
but also for git commands that make decisions based on rename
information, e.g. 'git log --follow other-dir/A/file' follows
'dir/B/file' past the rename.
This behavior is a side effect of commit v2.0.0-rc4~8^2~14
(diffcore-rename.c: simplify finding exact renames, 2013-11-14): the
hashmap storing sources returns entries from the same bucket, i.e.
sources matching the current destination, in LIFO order. Thus the
iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon
finding identical content and basename, reports an exact rename.
Other hashmap users are apparently happy with the current iteration
order over the entries of a bucket. Changing the iteration order
would risk upsetting other hashmap users and would increase the memory
footprint of each bucket by a pointer to the tail element.
Fill the hashmap with source entries in reverse order to restore the
original exact rename detection behavior.
Reported-by: Bill Okara <billokara@gmail.com>
Signed-off-by: SZEDER Gábor <szeder@ira.uka.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-03-30 10:35:07 +02:00
|
|
|
/* Add all sources to the hash table in reverse order, because
|
|
|
|
* later on they will be retrieved in LIFO order.
|
|
|
|
*/
|
2017-06-30 21:14:05 +02:00
|
|
|
hashmap_init(&file_table, NULL, NULL, rename_src_nr);
|
diffcore: fix iteration order of identical files during rename detection
If the two paths 'dir/A/file' and 'dir/B/file' have identical content
and the parent directory is renamed, e.g. 'git mv dir other-dir', then
diffcore reports the following exact renames:
renamed: dir/B/file -> other-dir/A/file
renamed: dir/A/file -> other-dir/B/file
While technically not wrong, this is confusing not only for the user,
but also for git commands that make decisions based on rename
information, e.g. 'git log --follow other-dir/A/file' follows
'dir/B/file' past the rename.
This behavior is a side effect of commit v2.0.0-rc4~8^2~14
(diffcore-rename.c: simplify finding exact renames, 2013-11-14): the
hashmap storing sources returns entries from the same bucket, i.e.
sources matching the current destination, in LIFO order. Thus the
iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon
finding identical content and basename, reports an exact rename.
Other hashmap users are apparently happy with the current iteration
order over the entries of a bucket. Changing the iteration order
would risk upsetting other hashmap users and would increase the memory
footprint of each bucket by a pointer to the tail element.
Fill the hashmap with source entries in reverse order to restore the
original exact rename detection behavior.
Reported-by: Bill Okara <billokara@gmail.com>
Signed-off-by: SZEDER Gábor <szeder@ira.uka.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-03-30 10:35:07 +02:00
|
|
|
for (i = rename_src_nr-1; i >= 0; i--)
|
2018-09-21 17:57:19 +02:00
|
|
|
insert_file_table(options->repo,
|
|
|
|
&file_table, i,
|
|
|
|
rename_src[i].p->one);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:19:34 +01:00
|
|
|
/* Walk the destinations and find best source match */
|
2007-10-25 20:23:26 +02:00
|
|
|
for (i = 0; i < rename_dst_nr; i++)
|
2013-11-14 20:19:34 +01:00
|
|
|
renames += find_identical_files(&file_table, i, options);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:20:26 +01:00
|
|
|
/* Free the hash data structure and entries */
|
2020-11-02 19:55:05 +01:00
|
|
|
hashmap_clear_and_free(&file_table, struct file_similarity, entry);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:19:34 +01:00
|
|
|
return renames;
|
2007-10-25 20:17:55 +02:00
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:40 +01:00
|
|
|
struct dir_rename_info {
|
|
|
|
struct strintmap idx_map;
|
|
|
|
struct strmap dir_rename_guess;
|
|
|
|
struct strmap *dir_rename_count;
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *relevant_source_dirs;
|
2021-02-27 01:30:40 +01:00
|
|
|
unsigned setup;
|
|
|
|
};
|
|
|
|
|
|
|
|
static char *get_dirname(const char *filename)
|
|
|
|
{
|
|
|
|
char *slash = strrchr(filename, '/');
|
|
|
|
return slash ? xstrndup(filename, slash - filename) : xstrdup("");
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:42 +01:00
|
|
|
static void dirname_munge(char *filename)
|
|
|
|
{
|
|
|
|
char *slash = strrchr(filename, '/');
|
|
|
|
if (!slash)
|
|
|
|
slash = filename;
|
|
|
|
*slash = '\0';
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:48 +01:00
|
|
|
static const char *get_highest_rename_path(struct strintmap *counts)
|
|
|
|
{
|
|
|
|
int highest_count = 0;
|
|
|
|
const char *highest_destination_dir = NULL;
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
|
|
|
|
strintmap_for_each_entry(counts, &iter, entry) {
|
|
|
|
const char *destination_dir = entry->key;
|
|
|
|
intptr_t count = (intptr_t)entry->value;
|
|
|
|
if (count > highest_count) {
|
|
|
|
highest_count = count;
|
|
|
|
highest_destination_dir = destination_dir;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return highest_destination_dir;
|
|
|
|
}
|
|
|
|
|
2021-03-13 23:22:05 +01:00
|
|
|
static char *UNKNOWN_DIR = "/"; /* placeholder -- short, illegal directory */
|
|
|
|
|
|
|
|
static int dir_rename_already_determinable(struct strintmap *counts)
|
|
|
|
{
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
int first = 0, second = 0, unknown = 0;
|
|
|
|
strintmap_for_each_entry(counts, &iter, entry) {
|
|
|
|
const char *destination_dir = entry->key;
|
|
|
|
intptr_t count = (intptr_t)entry->value;
|
|
|
|
if (!strcmp(destination_dir, UNKNOWN_DIR)) {
|
|
|
|
unknown = count;
|
|
|
|
} else if (count >= first) {
|
|
|
|
second = first;
|
|
|
|
first = count;
|
|
|
|
} else if (count >= second) {
|
|
|
|
second = count;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return first > second + unknown;
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:44 +01:00
|
|
|
static void increment_count(struct dir_rename_info *info,
|
2021-02-27 01:30:42 +01:00
|
|
|
char *old_dir,
|
|
|
|
char *new_dir)
|
|
|
|
{
|
|
|
|
struct strintmap *counts;
|
|
|
|
struct strmap_entry *e;
|
|
|
|
|
|
|
|
/* Get the {new_dirs -> counts} mapping using old_dir */
|
2021-02-27 01:30:44 +01:00
|
|
|
e = strmap_get_entry(info->dir_rename_count, old_dir);
|
2021-02-27 01:30:42 +01:00
|
|
|
if (e) {
|
|
|
|
counts = e->value;
|
|
|
|
} else {
|
|
|
|
counts = xmalloc(sizeof(*counts));
|
|
|
|
strintmap_init_with_options(counts, 0, NULL, 1);
|
2021-02-27 01:30:44 +01:00
|
|
|
strmap_put(info->dir_rename_count, old_dir, counts);
|
2021-02-27 01:30:42 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Increment the count for new_dir */
|
|
|
|
strintmap_incr(counts, new_dir, 1);
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:44 +01:00
|
|
|
static void update_dir_rename_counts(struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *dirs_removed,
|
2021-02-27 01:30:42 +01:00
|
|
|
const char *oldname,
|
|
|
|
const char *newname)
|
|
|
|
{
|
|
|
|
char *old_dir = xstrdup(oldname);
|
|
|
|
char *new_dir = xstrdup(newname);
|
|
|
|
char new_dir_first_char = new_dir[0];
|
|
|
|
int first_time_in_loop = 1;
|
|
|
|
|
2021-02-27 01:30:46 +01:00
|
|
|
if (!info->setup)
|
|
|
|
/*
|
|
|
|
* info->setup is 0 here in two cases: (1) all auxiliary
|
|
|
|
* vars (like dirs_removed) were NULL so
|
|
|
|
* initialize_dir_rename_info() returned early, or (2)
|
|
|
|
* either break detection or copy detection are active so
|
|
|
|
* that we never called initialize_dir_rename_info(). In
|
|
|
|
* the former case, we don't have enough info to know if
|
|
|
|
* directories were renamed (because dirs_removed lets us
|
|
|
|
* know about a necessary prerequisite, namely if they were
|
|
|
|
* removed), and in the latter, we don't care about
|
|
|
|
* directory renames or find_basename_matches.
|
|
|
|
*
|
|
|
|
* This matters because both basename and inexact matching
|
|
|
|
* will also call update_dir_rename_counts(). In either of
|
|
|
|
* the above two cases info->dir_rename_counts will not
|
|
|
|
* have been properly initialized which prevents us from
|
|
|
|
* updating it, but in these two cases we don't care about
|
|
|
|
* dir_rename_counts anyway, so we can just exit early.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
|
2021-02-27 01:30:42 +01:00
|
|
|
while (1) {
|
diffcore-rename: only compute dir_rename_count for relevant directories
When one side adds files to a directory that the other side renamed,
directory rename detection is used to either move the new paths to the
newer directory or warn the user about the fact that another path
location might be better.
If a parent of the given directory had new files added to it, any
renames in the current directory are also part of determining where the
parent directory is renamed to. Thus, naively, we need to record each
rename N times for a path at depth N. However, we can use the
additional information added to dirs_removed in the last commit to avoid
traversing all N parent directories in many cases. Let's use an example
to explain how this works. If we have a path named
src/old_dir/a/b/file.c
and src/old_dir doesn't exist on one side of history, but the other
added a file named src/old_dir/newfile.c, then if one side renamed
src/old_dir/a/b/file.c => source/new_dir/a/b/file.c
then this file would affect potential directory rename detection counts
for
src/old_dir/a/b => source/new_dir/a/b
src/old_dir/a => source/new_dir/a
src/old_dir => source/new_dir
src => source
adding a weight of 1 to each in dir_rename_counts. However, if src/
exists on both sides of history, then we don't need to track any entries
for it in dir_rename_counts. That was implemented previously. What we
are adding now, is that if no new files were added to src/old_dir/a or
src/old_dir/b, then we don't need to have counts in dir_rename_count
for those directories either.
In short, we only need to track counts in dir_rename_count for
directories whose dirs_removed value is RELEVANT_FOR_SELF. And as soon
as we reach a directory that isn't in dirs_removed (signalled by
returning the default value of NOT_RELEVANT from strintmap_get()), we
can stop looking any further up the directory hierarchy.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:04 +01:00
|
|
|
int drd_flag = NOT_RELEVANT;
|
|
|
|
|
2021-02-27 01:30:47 +01:00
|
|
|
/* Get old_dir, skip if its directory isn't relevant. */
|
2021-02-27 01:30:42 +01:00
|
|
|
dirname_munge(old_dir);
|
2021-02-27 01:30:47 +01:00
|
|
|
if (info->relevant_source_dirs &&
|
2021-03-13 23:22:02 +01:00
|
|
|
!strintmap_contains(info->relevant_source_dirs, old_dir))
|
2021-02-27 01:30:47 +01:00
|
|
|
break;
|
|
|
|
|
|
|
|
/* Get new_dir */
|
2021-02-27 01:30:42 +01:00
|
|
|
dirname_munge(new_dir);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When renaming
|
|
|
|
* "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
|
|
|
|
* then this suggests that both
|
|
|
|
* a/b/c/d/e/ => a/b/some/thing/else/e/
|
|
|
|
* a/b/c/d/ => a/b/some/thing/else/
|
|
|
|
* so we want to increment counters for both. We do NOT,
|
|
|
|
* however, also want to suggest that there was the following
|
|
|
|
* rename:
|
|
|
|
* a/b/c/ => a/b/some/thing/
|
|
|
|
* so we need to quit at that point.
|
|
|
|
*
|
|
|
|
* Note the when first_time_in_loop, we only strip off the
|
|
|
|
* basename, and we don't care if that's different.
|
|
|
|
*/
|
|
|
|
if (!first_time_in_loop) {
|
|
|
|
char *old_sub_dir = strchr(old_dir, '\0')+1;
|
|
|
|
char *new_sub_dir = strchr(new_dir, '\0')+1;
|
|
|
|
if (!*new_dir) {
|
|
|
|
/*
|
|
|
|
* Special case when renaming to root directory,
|
|
|
|
* i.e. when new_dir == "". In this case, we had
|
|
|
|
* something like
|
|
|
|
* a/b/subdir => subdir
|
|
|
|
* and so dirname_munge() sets things up so that
|
|
|
|
* old_dir = "a/b\0subdir\0"
|
|
|
|
* new_dir = "\0ubdir\0"
|
|
|
|
* We didn't have a '/' to overwrite a '\0' onto
|
|
|
|
* in new_dir, so we have to compare differently.
|
|
|
|
*/
|
|
|
|
if (new_dir_first_char != old_sub_dir[0] ||
|
|
|
|
strcmp(old_sub_dir+1, new_sub_dir))
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
if (strcmp(old_sub_dir, new_sub_dir))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
diffcore-rename: only compute dir_rename_count for relevant directories
When one side adds files to a directory that the other side renamed,
directory rename detection is used to either move the new paths to the
newer directory or warn the user about the fact that another path
location might be better.
If a parent of the given directory had new files added to it, any
renames in the current directory are also part of determining where the
parent directory is renamed to. Thus, naively, we need to record each
rename N times for a path at depth N. However, we can use the
additional information added to dirs_removed in the last commit to avoid
traversing all N parent directories in many cases. Let's use an example
to explain how this works. If we have a path named
src/old_dir/a/b/file.c
and src/old_dir doesn't exist on one side of history, but the other
added a file named src/old_dir/newfile.c, then if one side renamed
src/old_dir/a/b/file.c => source/new_dir/a/b/file.c
then this file would affect potential directory rename detection counts
for
src/old_dir/a/b => source/new_dir/a/b
src/old_dir/a => source/new_dir/a
src/old_dir => source/new_dir
src => source
adding a weight of 1 to each in dir_rename_counts. However, if src/
exists on both sides of history, then we don't need to track any entries
for it in dir_rename_counts. That was implemented previously. What we
are adding now, is that if no new files were added to src/old_dir/a or
src/old_dir/b, then we don't need to have counts in dir_rename_count
for those directories either.
In short, we only need to track counts in dir_rename_count for
directories whose dirs_removed value is RELEVANT_FOR_SELF. And as soon
as we reach a directory that isn't in dirs_removed (signalled by
returning the default value of NOT_RELEVANT from strintmap_get()), we
can stop looking any further up the directory hierarchy.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:04 +01:00
|
|
|
/*
|
|
|
|
* Above we suggested that we'd keep recording renames for
|
|
|
|
* all ancestor directories where the trailing directories
|
|
|
|
* matched, i.e. for
|
|
|
|
* "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
|
|
|
|
* we'd increment rename counts for each of
|
|
|
|
* a/b/c/d/e/ => a/b/some/thing/else/e/
|
|
|
|
* a/b/c/d/ => a/b/some/thing/else/
|
|
|
|
* However, we only need the rename counts for directories
|
|
|
|
* in dirs_removed whose value is RELEVANT_FOR_SELF.
|
|
|
|
* However, we add one special case of also recording it for
|
|
|
|
* first_time_in_loop because find_basename_matches() can
|
|
|
|
* use that as a hint to find a good pairing.
|
|
|
|
*/
|
|
|
|
if (dirs_removed)
|
|
|
|
drd_flag = strintmap_get(dirs_removed, old_dir);
|
|
|
|
if (drd_flag == RELEVANT_FOR_SELF || first_time_in_loop)
|
2021-02-27 01:30:44 +01:00
|
|
|
increment_count(info, old_dir, new_dir);
|
2021-02-27 01:30:42 +01:00
|
|
|
|
diffcore-rename: only compute dir_rename_count for relevant directories
When one side adds files to a directory that the other side renamed,
directory rename detection is used to either move the new paths to the
newer directory or warn the user about the fact that another path
location might be better.
If a parent of the given directory had new files added to it, any
renames in the current directory are also part of determining where the
parent directory is renamed to. Thus, naively, we need to record each
rename N times for a path at depth N. However, we can use the
additional information added to dirs_removed in the last commit to avoid
traversing all N parent directories in many cases. Let's use an example
to explain how this works. If we have a path named
src/old_dir/a/b/file.c
and src/old_dir doesn't exist on one side of history, but the other
added a file named src/old_dir/newfile.c, then if one side renamed
src/old_dir/a/b/file.c => source/new_dir/a/b/file.c
then this file would affect potential directory rename detection counts
for
src/old_dir/a/b => source/new_dir/a/b
src/old_dir/a => source/new_dir/a
src/old_dir => source/new_dir
src => source
adding a weight of 1 to each in dir_rename_counts. However, if src/
exists on both sides of history, then we don't need to track any entries
for it in dir_rename_counts. That was implemented previously. What we
are adding now, is that if no new files were added to src/old_dir/a or
src/old_dir/b, then we don't need to have counts in dir_rename_count
for those directories either.
In short, we only need to track counts in dir_rename_count for
directories whose dirs_removed value is RELEVANT_FOR_SELF. And as soon
as we reach a directory that isn't in dirs_removed (signalled by
returning the default value of NOT_RELEVANT from strintmap_get()), we
can stop looking any further up the directory hierarchy.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:04 +01:00
|
|
|
first_time_in_loop = 0;
|
|
|
|
if (drd_flag == NOT_RELEVANT)
|
|
|
|
break;
|
2021-02-27 01:30:42 +01:00
|
|
|
/* If we hit toplevel directory ("") for old or new dir, quit */
|
|
|
|
if (!*old_dir || !*new_dir)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Free resources we don't need anymore */
|
|
|
|
free(old_dir);
|
|
|
|
free(new_dir);
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:46 +01:00
|
|
|
static void initialize_dir_rename_info(struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *relevant_sources,
|
|
|
|
struct strintmap *dirs_removed,
|
2021-02-27 01:30:46 +01:00
|
|
|
struct strmap *dir_rename_count)
|
2021-02-27 01:30:42 +01:00
|
|
|
{
|
2021-02-27 01:30:48 +01:00
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
2021-02-27 01:30:42 +01:00
|
|
|
int i;
|
|
|
|
|
2021-03-11 01:38:31 +01:00
|
|
|
if (!dirs_removed && !relevant_sources) {
|
2021-02-27 01:30:46 +01:00
|
|
|
info->setup = 0;
|
|
|
|
return;
|
2021-02-27 01:30:42 +01:00
|
|
|
}
|
2021-02-27 01:30:41 +01:00
|
|
|
info->setup = 1;
|
|
|
|
|
2021-02-27 01:30:46 +01:00
|
|
|
info->dir_rename_count = dir_rename_count;
|
|
|
|
if (!info->dir_rename_count) {
|
|
|
|
info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
|
|
|
|
strmap_init(info->dir_rename_count);
|
|
|
|
}
|
2021-02-27 01:30:41 +01:00
|
|
|
strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
|
|
|
|
strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
|
|
|
|
|
2021-02-27 01:30:47 +01:00
|
|
|
/* Setup info->relevant_source_dirs */
|
2021-03-11 01:38:31 +01:00
|
|
|
info->relevant_source_dirs = NULL;
|
|
|
|
if (dirs_removed || !relevant_sources) {
|
|
|
|
info->relevant_source_dirs = dirs_removed; /* might be NULL */
|
|
|
|
} else {
|
|
|
|
info->relevant_source_dirs = xmalloc(sizeof(struct strintmap));
|
2021-03-13 23:22:02 +01:00
|
|
|
strintmap_init(info->relevant_source_dirs, 0 /* unused */);
|
|
|
|
strintmap_for_each_entry(relevant_sources, &iter, entry) {
|
2021-03-11 01:38:31 +01:00
|
|
|
char *dirname = get_dirname(entry->key);
|
|
|
|
if (!dirs_removed ||
|
2021-03-13 23:22:02 +01:00
|
|
|
strintmap_contains(dirs_removed, dirname))
|
|
|
|
strintmap_set(info->relevant_source_dirs,
|
|
|
|
dirname, 0 /* value irrelevant */);
|
2021-03-11 01:38:31 +01:00
|
|
|
free(dirname);
|
|
|
|
}
|
|
|
|
}
|
2021-02-27 01:30:47 +01:00
|
|
|
|
2021-02-27 01:30:41 +01:00
|
|
|
/*
|
2021-02-27 01:30:46 +01:00
|
|
|
* Loop setting up both info->idx_map, and doing setup of
|
|
|
|
* info->dir_rename_count.
|
2021-02-27 01:30:41 +01:00
|
|
|
*/
|
|
|
|
for (i = 0; i < rename_dst_nr; ++i) {
|
|
|
|
/*
|
|
|
|
* For non-renamed files, make idx_map contain mapping of
|
|
|
|
* filename -> index (index within rename_dst, that is)
|
|
|
|
*/
|
|
|
|
if (!rename_dst[i].is_rename) {
|
|
|
|
char *filename = rename_dst[i].p->two->path;
|
|
|
|
strintmap_set(&info->idx_map, filename, i);
|
2021-02-27 01:30:46 +01:00
|
|
|
continue;
|
2021-02-27 01:30:41 +01:00
|
|
|
}
|
2021-02-27 01:30:46 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For everything else (i.e. renamed files), make
|
|
|
|
* dir_rename_count contain a map of a map:
|
|
|
|
* old_directory -> {new_directory -> count}
|
|
|
|
* In other words, for every pair look at the directories for
|
|
|
|
* the old filename and the new filename and count how many
|
|
|
|
* times that pairing occurs.
|
|
|
|
*/
|
|
|
|
update_dir_rename_counts(info, dirs_removed,
|
|
|
|
rename_dst[i].p->one->path,
|
|
|
|
rename_dst[i].p->two->path);
|
2021-02-27 01:30:41 +01:00
|
|
|
}
|
2021-02-27 01:30:48 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now we collapse
|
|
|
|
* dir_rename_count: old_directory -> {new_directory -> count}
|
|
|
|
* down to
|
|
|
|
* dir_rename_guess: old_directory -> best_new_directory
|
|
|
|
* where best_new_directory is the one with the highest count.
|
|
|
|
*/
|
|
|
|
strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
|
|
|
|
/* entry->key is source_dir */
|
|
|
|
struct strintmap *counts = entry->value;
|
|
|
|
char *best_newdir;
|
|
|
|
|
|
|
|
best_newdir = xstrdup(get_highest_rename_path(counts));
|
|
|
|
strmap_put(&info->dir_rename_guess, entry->key,
|
|
|
|
best_newdir);
|
|
|
|
}
|
2021-02-27 01:30:41 +01:00
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:43 +01:00
|
|
|
void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
|
|
|
|
{
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
|
|
|
|
strmap_for_each_entry(dir_rename_count, &iter, entry) {
|
|
|
|
struct strintmap *counts = entry->value;
|
|
|
|
strintmap_clear(counts);
|
|
|
|
}
|
|
|
|
strmap_partial_clear(dir_rename_count, 1);
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:45 +01:00
|
|
|
static void cleanup_dir_rename_info(struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *dirs_removed,
|
2021-02-27 01:30:45 +01:00
|
|
|
int keep_dir_rename_count)
|
2021-02-27 01:30:41 +01:00
|
|
|
{
|
2021-02-27 01:30:45 +01:00
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
struct string_list to_remove = STRING_LIST_INIT_NODUP;
|
|
|
|
int i;
|
|
|
|
|
2021-02-27 01:30:41 +01:00
|
|
|
if (!info->setup)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* idx_map */
|
|
|
|
strintmap_clear(&info->idx_map);
|
|
|
|
|
|
|
|
/* dir_rename_guess */
|
|
|
|
strmap_clear(&info->dir_rename_guess, 1);
|
|
|
|
|
2021-03-11 01:38:31 +01:00
|
|
|
/* relevant_source_dirs */
|
|
|
|
if (info->relevant_source_dirs &&
|
|
|
|
info->relevant_source_dirs != dirs_removed) {
|
2021-03-13 23:22:02 +01:00
|
|
|
strintmap_clear(info->relevant_source_dirs);
|
2021-03-11 01:38:31 +01:00
|
|
|
FREE_AND_NULL(info->relevant_source_dirs);
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:44 +01:00
|
|
|
/* dir_rename_count */
|
2021-02-27 01:30:45 +01:00
|
|
|
if (!keep_dir_rename_count) {
|
|
|
|
partial_clear_dir_rename_count(info->dir_rename_count);
|
|
|
|
strmap_clear(info->dir_rename_count, 1);
|
|
|
|
FREE_AND_NULL(info->dir_rename_count);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Although dir_rename_count was passed in
|
|
|
|
* diffcore_rename_extended() and we want to keep it around and
|
2021-03-13 23:22:06 +01:00
|
|
|
* return it to that caller, we first want to remove any counts in
|
|
|
|
* the maps associated with UNKNOWN_DIR entries and any data
|
2021-02-27 01:30:45 +01:00
|
|
|
* associated with directories that weren't renamed.
|
|
|
|
*/
|
|
|
|
strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
|
|
|
|
const char *source_dir = entry->key;
|
|
|
|
struct strintmap *counts = entry->value;
|
|
|
|
|
2021-03-13 23:22:03 +01:00
|
|
|
if (!strintmap_get(dirs_removed, source_dir)) {
|
2021-02-27 01:30:45 +01:00
|
|
|
string_list_append(&to_remove, source_dir);
|
|
|
|
strintmap_clear(counts);
|
|
|
|
continue;
|
|
|
|
}
|
2021-03-13 23:22:06 +01:00
|
|
|
|
|
|
|
if (strintmap_contains(counts, UNKNOWN_DIR))
|
|
|
|
strintmap_remove(counts, UNKNOWN_DIR);
|
2021-02-27 01:30:45 +01:00
|
|
|
}
|
|
|
|
for (i = 0; i < to_remove.nr; ++i)
|
|
|
|
strmap_remove(info->dir_rename_count,
|
|
|
|
to_remove.items[i].string, 1);
|
|
|
|
string_list_clear(&to_remove, 0);
|
2021-02-27 01:30:41 +01:00
|
|
|
}
|
|
|
|
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
static const char *get_basename(const char *filename)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* gitbasename() has to worry about special drives, multiple
|
|
|
|
* directory separator characters, trailing slashes, NULL or
|
|
|
|
* empty strings, etc. We only work on filenames as stored in
|
|
|
|
* git, and thus get to ignore all those complications.
|
|
|
|
*/
|
|
|
|
const char *base = strrchr(filename, '/');
|
|
|
|
return base ? base + 1 : filename;
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:40 +01:00
|
|
|
static int idx_possible_rename(char *filename, struct dir_rename_info *info)
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
{
|
2021-02-27 01:30:40 +01:00
|
|
|
/*
|
|
|
|
* Our comparison of files with the same basename (see
|
|
|
|
* find_basename_matches() below), is only helpful when after exact
|
|
|
|
* rename detection we have exactly one file with a given basename
|
|
|
|
* among the rename sources and also only exactly one file with
|
|
|
|
* that basename among the rename destinations. When we have
|
|
|
|
* multiple files with the same basename in either set, we do not
|
|
|
|
* know which to compare against. However, there are some
|
|
|
|
* filenames that occur in large numbers (particularly
|
|
|
|
* build-related filenames such as 'Makefile', '.gitignore', or
|
|
|
|
* 'build.gradle' that potentially exist within every single
|
|
|
|
* subdirectory), and for performance we want to be able to quickly
|
|
|
|
* find renames for these files too.
|
|
|
|
*
|
|
|
|
* The reason basename comparisons are a useful heuristic was that it
|
|
|
|
* is common for people to move files across directories while keeping
|
|
|
|
* their filename the same. If we had a way of determining or even
|
|
|
|
* making a good educated guess about which directory these non-unique
|
|
|
|
* basename files had moved the file to, we could check it.
|
|
|
|
* Luckily...
|
|
|
|
*
|
|
|
|
* When an entire directory is in fact renamed, we have two factors
|
|
|
|
* helping us out:
|
|
|
|
* (a) the original directory disappeared giving us a hint
|
|
|
|
* about when we can apply an extra heuristic.
|
|
|
|
* (a) we often have several files within that directory and
|
|
|
|
* subdirectories that are renamed without changes
|
|
|
|
* So, rules for a heuristic:
|
|
|
|
* (0) If there basename matches are non-unique (the condition under
|
|
|
|
* which this function is called) AND
|
|
|
|
* (1) the directory in which the file was found has disappeared
|
|
|
|
* (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
|
|
|
|
* (2) use exact renames of files within the directory to determine
|
|
|
|
* where the directory is likely to have been renamed to. IF
|
|
|
|
* there is at least one exact rename from within that
|
|
|
|
* directory, we can proceed.
|
|
|
|
* (3) If there are multiple places the directory could have been
|
|
|
|
* renamed to based on exact renames, ignore all but one of them.
|
|
|
|
* Just use the destination with the most renames going to it.
|
|
|
|
* (4) Check if applying that directory rename to the original file
|
|
|
|
* would result in a destination filename that is in the
|
|
|
|
* potential rename set. If so, return the index of the
|
|
|
|
* destination file (the index within rename_dst).
|
|
|
|
* (5) Compare the original file and returned destination for
|
|
|
|
* similarity, and if they are sufficiently similar, record the
|
|
|
|
* rename.
|
|
|
|
*
|
|
|
|
* This function, idx_possible_rename(), is only responsible for (4).
|
2021-02-27 01:30:48 +01:00
|
|
|
* The conditions/steps in (1)-(3) are handled via setting up
|
|
|
|
* dir_rename_count and dir_rename_guess in
|
|
|
|
* initialize_dir_rename_info(). Steps (0) and (5) are handled by
|
|
|
|
* the caller of this function.
|
2021-02-27 01:30:40 +01:00
|
|
|
*/
|
|
|
|
char *old_dir, *new_dir;
|
|
|
|
struct strbuf new_path = STRBUF_INIT;
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
if (!info->setup)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
old_dir = get_dirname(filename);
|
|
|
|
new_dir = strmap_get(&info->dir_rename_guess, old_dir);
|
|
|
|
free(old_dir);
|
|
|
|
if (!new_dir)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
strbuf_addstr(&new_path, new_dir);
|
|
|
|
strbuf_addch(&new_path, '/');
|
|
|
|
strbuf_addstr(&new_path, get_basename(filename));
|
|
|
|
|
|
|
|
idx = strintmap_get(&info->idx_map, new_path.buf);
|
|
|
|
strbuf_release(&new_path);
|
|
|
|
return idx;
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
}
|
|
|
|
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
static int find_basename_matches(struct diff_options *options,
|
2021-02-27 01:30:40 +01:00
|
|
|
int minimum_score,
|
2021-02-27 01:30:46 +01:00
|
|
|
struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *relevant_sources,
|
|
|
|
struct strintmap *dirs_removed)
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
{
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
/*
|
|
|
|
* When I checked in early 2020, over 76% of file renames in linux
|
|
|
|
* just moved files to a different directory but kept the same
|
|
|
|
* basename. gcc did that with over 64% of renames, gecko did it
|
|
|
|
* with over 79%, and WebKit did it with over 89%.
|
|
|
|
*
|
|
|
|
* Therefore we can bypass the normal exhaustive NxM matrix
|
|
|
|
* comparison of similarities between all potential rename sources
|
|
|
|
* and destinations by instead using file basename as a hint (i.e.
|
|
|
|
* the portion of the filename after the last '/'), checking for
|
|
|
|
* similarity between files with the same basename, and if we find
|
|
|
|
* a pair that are sufficiently similar, record the rename pair and
|
|
|
|
* exclude those two from the NxM matrix.
|
|
|
|
*
|
|
|
|
* This *might* cause us to find a less than optimal pairing (if
|
|
|
|
* there is another file that we are even more similar to but has a
|
|
|
|
* different basename). Given the huge performance advantage
|
|
|
|
* basename matching provides, and given the frequency with which
|
|
|
|
* people use the same basename in real world projects, that's a
|
|
|
|
* trade-off we are willing to accept when doing just rename
|
|
|
|
* detection.
|
|
|
|
*
|
|
|
|
* If someone wants copy detection that implies they are willing to
|
|
|
|
* spend more cycles to find similarities between files, so it may
|
|
|
|
* be less likely that this heuristic is wanted. If someone is
|
|
|
|
* doing break detection, that means they do not want filename
|
|
|
|
* similarity to imply any form of content similiarity, and thus
|
|
|
|
* this heuristic would definitely be incompatible.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int i, renames = 0;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
struct strintmap sources;
|
|
|
|
struct strintmap dests;
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The prefeteching stuff wants to know if it can skip prefetching
|
|
|
|
* blobs that are unmodified...and will then do a little extra work
|
|
|
|
* to verify that the oids are indeed different before prefetching.
|
|
|
|
* Unmodified blobs are only relevant when doing copy detection;
|
|
|
|
* when limiting to rename detection, diffcore_rename[_extended]()
|
|
|
|
* will never be called with unmodified source paths fed to us, so
|
|
|
|
* the extra work necessary to check if rename_src entries are
|
|
|
|
* unmodified would be a small waste.
|
|
|
|
*/
|
|
|
|
int skip_unmodified = 0;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create maps of basename -> fullname(s) for remaining sources and
|
|
|
|
* dests.
|
|
|
|
*/
|
|
|
|
strintmap_init_with_options(&sources, -1, NULL, 0);
|
|
|
|
strintmap_init_with_options(&dests, -1, NULL, 0);
|
|
|
|
for (i = 0; i < rename_src_nr; ++i) {
|
|
|
|
char *filename = rename_src[i].p->one->path;
|
|
|
|
const char *base;
|
|
|
|
|
|
|
|
/* exact renames removed in remove_unneeded_paths_from_src() */
|
|
|
|
assert(!rename_src[i].p->one->rename_used);
|
|
|
|
|
|
|
|
/* Record index within rename_src (i) if basename is unique */
|
|
|
|
base = get_basename(filename);
|
|
|
|
if (strintmap_contains(&sources, base))
|
|
|
|
strintmap_set(&sources, base, -1);
|
|
|
|
else
|
|
|
|
strintmap_set(&sources, base, i);
|
|
|
|
}
|
|
|
|
for (i = 0; i < rename_dst_nr; ++i) {
|
|
|
|
char *filename = rename_dst[i].p->two->path;
|
|
|
|
const char *base;
|
|
|
|
|
|
|
|
if (rename_dst[i].is_rename)
|
|
|
|
continue; /* involved in exact match already. */
|
|
|
|
|
|
|
|
/* Record index within rename_dst (i) if basename is unique */
|
|
|
|
base = get_basename(filename);
|
|
|
|
if (strintmap_contains(&dests, base))
|
|
|
|
strintmap_set(&dests, base, -1);
|
|
|
|
else
|
|
|
|
strintmap_set(&dests, base, i);
|
|
|
|
}
|
|
|
|
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
/* Now look for basename matchups and do similarity estimation */
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
for (i = 0; i < rename_src_nr; ++i) {
|
|
|
|
char *filename = rename_src[i].p->one->path;
|
|
|
|
const char *base = NULL;
|
|
|
|
intptr_t src_index;
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
intptr_t dst_index;
|
|
|
|
|
2021-03-11 01:38:31 +01:00
|
|
|
/* Skip irrelevant sources */
|
|
|
|
if (relevant_sources &&
|
2021-03-13 23:22:02 +01:00
|
|
|
!strintmap_contains(relevant_sources, filename))
|
2021-03-11 01:38:31 +01:00
|
|
|
continue;
|
|
|
|
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
/*
|
|
|
|
* If the basename is unique among remaining sources, then
|
|
|
|
* src_index will equal 'i' and we can attempt to match it
|
|
|
|
* to a unique basename in the destinations. Otherwise,
|
|
|
|
* use directory rename heuristics, if possible.
|
|
|
|
*/
|
|
|
|
base = get_basename(filename);
|
|
|
|
src_index = strintmap_get(&sources, base);
|
|
|
|
assert(src_index == -1 || src_index == i);
|
|
|
|
|
|
|
|
if (strintmap_contains(&dests, base)) {
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
struct diff_filespec *one, *two;
|
|
|
|
int score;
|
|
|
|
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
/* Find a matching destination, if possible */
|
|
|
|
dst_index = strintmap_get(&dests, base);
|
|
|
|
if (src_index == -1 || dst_index == -1) {
|
|
|
|
src_index = i;
|
2021-02-27 01:30:40 +01:00
|
|
|
dst_index = idx_possible_rename(filename, info);
|
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same. The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.
Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique. Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name. (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to. Such a task is precisely the job of directory rename
detection. However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete. The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames. In some cases we won't have
exact renames, in which case this extra optimization won't apply. We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended(). Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.
This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename. For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-27 01:30:39 +01:00
|
|
|
}
|
|
|
|
if (dst_index == -1)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Ignore this dest if already used in a rename */
|
|
|
|
if (rename_dst[dst_index].is_rename)
|
|
|
|
continue; /* already used previously */
|
|
|
|
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
/* Estimate the similarity */
|
|
|
|
one = rename_src[src_index].p->one;
|
|
|
|
two = rename_dst[dst_index].p->two;
|
|
|
|
score = estimate_similarity(options->repo, one, two,
|
|
|
|
minimum_score, skip_unmodified);
|
|
|
|
|
|
|
|
/* If sufficiently similar, record as rename pair */
|
|
|
|
if (score < minimum_score)
|
|
|
|
continue;
|
|
|
|
record_rename_pair(dst_index, src_index, score);
|
|
|
|
renames++;
|
2021-02-27 01:30:46 +01:00
|
|
|
update_dir_rename_counts(info, dirs_removed,
|
|
|
|
one->path, two->path);
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Found a rename so don't need text anymore; if we
|
|
|
|
* didn't find a rename, the filespec_blob would get
|
|
|
|
* re-used when doing the matrix of comparisons.
|
|
|
|
*/
|
|
|
|
diff_free_filespec_blob(one);
|
|
|
|
diff_free_filespec_blob(two);
|
|
|
|
}
|
|
|
|
}
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
|
|
|
|
strintmap_clear(&sources);
|
|
|
|
strintmap_clear(&dests);
|
|
|
|
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
return renames;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
}
|
|
|
|
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
#define NUM_CANDIDATE_PER_DST 4
|
|
|
|
static void record_if_better(struct diff_score m[], struct diff_score *o)
|
|
|
|
{
|
|
|
|
int i, worst;
|
|
|
|
|
|
|
|
/* find the worst one */
|
|
|
|
worst = 0;
|
|
|
|
for (i = 1; i < NUM_CANDIDATE_PER_DST; i++)
|
|
|
|
if (score_compare(&m[i], &m[worst]) > 0)
|
|
|
|
worst = i;
|
|
|
|
|
|
|
|
/* is it better than the worst one? */
|
|
|
|
if (score_compare(&m[worst], o) > 0)
|
|
|
|
m[worst] = *o;
|
|
|
|
}
|
|
|
|
|
2011-01-06 22:50:06 +01:00
|
|
|
/*
|
|
|
|
* Returns:
|
|
|
|
* 0 if we are under the limit;
|
|
|
|
* 1 if we need to disable inexact rename detection;
|
|
|
|
* 2 if we would be under the limit if we were given -C instead of -C -C.
|
|
|
|
*/
|
2020-12-11 10:08:41 +01:00
|
|
|
static int too_many_rename_candidates(int num_destinations, int num_sources,
|
2011-01-06 22:50:04 +01:00
|
|
|
struct diff_options *options)
|
|
|
|
{
|
|
|
|
int rename_limit = options->rename_limit;
|
2020-12-11 10:08:41 +01:00
|
|
|
int i, limited_sources;
|
2011-01-06 22:50:04 +01:00
|
|
|
|
|
|
|
options->needed_rename_limit = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This basically does a test for the rename matrix not
|
|
|
|
* growing larger than a "rename_limit" square matrix, ie:
|
|
|
|
*
|
2020-12-11 10:08:41 +01:00
|
|
|
* num_destinations * num_sources > rename_limit * rename_limit
|
2020-12-11 10:08:42 +01:00
|
|
|
*
|
|
|
|
* We use st_mult() to check overflow conditions; in the
|
|
|
|
* exceptional circumstance that size_t isn't large enough to hold
|
|
|
|
* the multiplication, the system won't be able to allocate enough
|
|
|
|
* memory for the matrix anyway.
|
2011-01-06 22:50:04 +01:00
|
|
|
*/
|
2017-11-29 21:11:54 +01:00
|
|
|
if (rename_limit <= 0)
|
|
|
|
rename_limit = 32767;
|
2020-12-11 10:08:42 +01:00
|
|
|
if (st_mult(num_destinations, num_sources)
|
|
|
|
<= st_mult(rename_limit, rename_limit))
|
2011-01-06 22:50:04 +01:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
options->needed_rename_limit =
|
2020-12-11 10:08:41 +01:00
|
|
|
num_sources > num_destinations ? num_sources : num_destinations;
|
2011-01-06 22:50:06 +01:00
|
|
|
|
|
|
|
/* Are we running under -C -C? */
|
2017-10-31 19:19:11 +01:00
|
|
|
if (!options->flags.find_copies_harder)
|
2011-01-06 22:50:06 +01:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* Would we bust the limit if we were running under -C? */
|
2020-12-11 10:08:41 +01:00
|
|
|
for (limited_sources = i = 0; i < num_sources; i++) {
|
2011-01-06 22:50:06 +01:00
|
|
|
if (diff_unmodified_pair(rename_src[i].p))
|
|
|
|
continue;
|
2020-12-11 10:08:41 +01:00
|
|
|
limited_sources++;
|
2011-01-06 22:50:06 +01:00
|
|
|
}
|
2020-12-11 10:08:42 +01:00
|
|
|
if (st_mult(num_destinations, limited_sources)
|
|
|
|
<= st_mult(rename_limit, rename_limit))
|
2011-01-06 22:50:06 +01:00
|
|
|
return 2;
|
2011-01-06 22:50:04 +01:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:46 +01:00
|
|
|
static int find_renames(struct diff_score *mx,
|
|
|
|
int dst_cnt,
|
|
|
|
int minimum_score,
|
|
|
|
int copies,
|
|
|
|
struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *dirs_removed)
|
2011-02-19 05:10:32 +01:00
|
|
|
{
|
|
|
|
int count = 0, i;
|
|
|
|
|
|
|
|
for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
|
|
|
|
struct diff_rename_dst *dst;
|
|
|
|
|
|
|
|
if ((mx[i].dst < 0) ||
|
|
|
|
(mx[i].score < minimum_score))
|
|
|
|
break; /* there is no more usable pair. */
|
|
|
|
dst = &rename_dst[mx[i].dst];
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (dst->is_rename)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue; /* already done, either exact or fuzzy. */
|
2011-01-06 22:50:05 +01:00
|
|
|
if (!copies && rename_src[mx[i].src].p->one->rename_used)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue;
|
|
|
|
record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
|
|
|
|
count++;
|
2021-02-27 01:30:46 +01:00
|
|
|
update_dir_rename_counts(info, dirs_removed,
|
|
|
|
rename_src[mx[i].src].p->one->path,
|
|
|
|
rename_dst[mx[i].dst].p->two->path);
|
2011-02-19 05:10:32 +01:00
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
static void remove_unneeded_paths_from_src(int detecting_copies,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *interesting)
|
2021-02-14 08:35:01 +01:00
|
|
|
{
|
|
|
|
int i, new_num_src;
|
|
|
|
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
if (detecting_copies && !interesting)
|
2021-02-14 08:35:01 +01:00
|
|
|
return; /* nothing to remove */
|
|
|
|
if (break_idx)
|
|
|
|
return; /* culling incompatible with break detection */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note on reasons why we cull unneeded sources but not destinations:
|
|
|
|
* 1) Pairings are stored in rename_dst (not rename_src), which we
|
|
|
|
* need to keep around. So, we just can't cull rename_dst even
|
|
|
|
* if we wanted to. But doing so wouldn't help because...
|
|
|
|
*
|
|
|
|
* 2) There is a matrix pairwise comparison that follows the
|
|
|
|
* "Performing inexact rename detection" progress message.
|
|
|
|
* Iterating over the destinations is done in the outer loop,
|
|
|
|
* hence we only iterate over each of those once and we can
|
|
|
|
* easily skip the outer loop early if the destination isn't
|
|
|
|
* relevant. That's only one check per destination path to
|
|
|
|
* skip.
|
|
|
|
*
|
|
|
|
* By contrast, the sources are iterated in the inner loop; if
|
|
|
|
* we check whether a source can be skipped, then we'll be
|
|
|
|
* checking it N separate times, once for each destination.
|
|
|
|
* We don't want to have to iterate over known-not-needed
|
|
|
|
* sources N times each, so avoid that by removing the sources
|
|
|
|
* from rename_src here.
|
|
|
|
*/
|
|
|
|
for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
struct diff_filespec *one = rename_src[i].p->one;
|
|
|
|
|
2021-02-14 08:35:01 +01:00
|
|
|
/*
|
|
|
|
* renames are stored in rename_dst, so if a rename has
|
|
|
|
* already been detected using this source, we can just
|
|
|
|
* remove the source knowing rename_dst has its info.
|
|
|
|
*/
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
if (!detecting_copies && one->rename_used)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we don't care about the source path, skip it */
|
2021-03-13 23:22:02 +01:00
|
|
|
if (interesting && !strintmap_contains(interesting, one->path))
|
2021-02-14 08:35:01 +01:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (new_num_src < i)
|
|
|
|
memcpy(&rename_src[new_num_src], &rename_src[i],
|
|
|
|
sizeof(struct diff_rename_src));
|
|
|
|
new_num_src++;
|
|
|
|
}
|
|
|
|
|
|
|
|
rename_src_nr = new_num_src;
|
|
|
|
}
|
|
|
|
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
static void handle_early_known_dir_renames(struct dir_rename_info *info,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *relevant_sources,
|
|
|
|
struct strintmap *dirs_removed)
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
{
|
|
|
|
/*
|
2021-03-13 23:22:05 +01:00
|
|
|
* Directory renames are determined via an aggregate of all renames
|
|
|
|
* under them and using a "majority wins" rule. The fact that
|
|
|
|
* "majority wins", though, means we don't need all the renames
|
|
|
|
* under the given directory, we only need enough to ensure we have
|
|
|
|
* a majority.
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
*/
|
2021-03-13 23:22:05 +01:00
|
|
|
|
diffcore-rename: determine which relevant_sources are no longer relevant
As noted a few commits ago ("diffcore-rename: only compute
dir_rename_count for relevant directories"), when a source file rename
is used as part of directory rename detection, we need to increment
counts for each ancestor directory in dirs_removed with value
RELEVANT_FOR_SELF. However, a few commits ago ("diffcore-rename: check
if we have enough renames for directories early on"), we may have
downgraded all relevant ancestor directories from RELEVANT_FOR_SELF to
RELEVANT_FOR_ANCESTOR.
For a given file, if no ancestor directory is found in dirs_removed with
a value of RELEVANT_FOR_SELF, then we can downgrade
relevant_source[PATH] from RELEVANT_LOCATION to RELEVANT_NO_MORE. This
means we can skip detecting a rename for that particular path (and any
other paths in the same directory).
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 5.680 s ± 0.096 s 5.665 s ± 0.129 s
mega-renames: 13.812 s ± 0.162 s 11.435 s ± 0.158 s
just-one-mega: 506.0 ms ± 3.9 ms 494.2 ms ± 6.1 ms
While this improvement looks rather modest for these testcases (because
all the previous optimizations were sufficient to nearly remove all time
spent in rename detection already), consider this alternative testcase
tweaked from the ones in commit 557ac0350d as follows
<Same initial setup as commit 557ac0350d, then...>
$ git switch -c add-empty-file v5.5
$ >drivers/gpu/drm/i915/new-empty-file
$ git add drivers/gpu/drm/i915/new-empty-file
$ git commit -m "new file"
$ git switch 5.4-rename
$ git cherry-pick --strategy=ort add-empty-file
For this testcase, we see the following improvement:
Before After
pick-empty: 1.936 s ± 0.024 s 688.1 ms ± 4.2 ms
So roughly a factor of 3 speedup. At $DAYJOB, there was a particular
repository and cherry-pick that inspired this optimization; for that
case I saw a speedup factor of 7 with this optimization.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:08 +01:00
|
|
|
int i, new_num_src;
|
2021-03-13 23:22:05 +01:00
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
|
|
|
|
if (!dirs_removed || !relevant_sources)
|
|
|
|
return; /* nothing to cull */
|
|
|
|
if (break_idx)
|
|
|
|
return; /* culling incompatbile with break detection */
|
|
|
|
|
|
|
|
/*
|
2021-03-13 23:22:06 +01:00
|
|
|
* Supplement dir_rename_count with number of potential renames,
|
|
|
|
* marking all potential rename sources as mapping to UNKNOWN_DIR.
|
2021-03-13 23:22:05 +01:00
|
|
|
*/
|
2021-03-13 23:22:06 +01:00
|
|
|
for (i = 0; i < rename_src_nr; i++) {
|
|
|
|
char *old_dir;
|
|
|
|
struct diff_filespec *one = rename_src[i].p->one;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* sources that are part of a rename will have already been
|
|
|
|
* removed by a prior call to remove_unneeded_paths_from_src()
|
|
|
|
*/
|
|
|
|
assert(!one->rename_used);
|
|
|
|
|
|
|
|
old_dir = get_dirname(one->path);
|
|
|
|
while (*old_dir != '\0' &&
|
|
|
|
NOT_RELEVANT != strintmap_get(dirs_removed, old_dir)) {
|
|
|
|
char *freeme = old_dir;
|
|
|
|
|
|
|
|
increment_count(info, old_dir, UNKNOWN_DIR);
|
|
|
|
old_dir = get_dirname(old_dir);
|
|
|
|
|
|
|
|
/* Free resources we don't need anymore */
|
|
|
|
free(freeme);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* old_dir and new_dir free'd in increment_count, but
|
|
|
|
* get_dirname() gives us a new pointer we need to free for
|
|
|
|
* old_dir. Also, if the loop runs 0 times we need old_dir
|
|
|
|
* to be freed.
|
|
|
|
*/
|
|
|
|
free(old_dir);
|
|
|
|
}
|
2021-03-13 23:22:05 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For any directory which we need a potential rename detected for
|
|
|
|
* (i.e. those marked as RELEVANT_FOR_SELF in dirs_removed), check
|
|
|
|
* whether we have enough renames to satisfy the "majority rules"
|
|
|
|
* requirement such that detecting any more renames of files under
|
|
|
|
* it won't change the result. For any such directory, mark that
|
|
|
|
* we no longer need to detect a rename for it. However, since we
|
|
|
|
* might need to still detect renames for an ancestor of that
|
|
|
|
* directory, use RELEVANT_FOR_ANCESTOR.
|
|
|
|
*/
|
|
|
|
strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
|
|
|
|
/* entry->key is source_dir */
|
|
|
|
struct strintmap *counts = entry->value;
|
|
|
|
|
|
|
|
if (strintmap_get(dirs_removed, entry->key) ==
|
|
|
|
RELEVANT_FOR_SELF &&
|
|
|
|
dir_rename_already_determinable(counts)) {
|
|
|
|
strintmap_set(dirs_removed, entry->key,
|
|
|
|
RELEVANT_FOR_ANCESTOR);
|
|
|
|
}
|
|
|
|
}
|
diffcore-rename: determine which relevant_sources are no longer relevant
As noted a few commits ago ("diffcore-rename: only compute
dir_rename_count for relevant directories"), when a source file rename
is used as part of directory rename detection, we need to increment
counts for each ancestor directory in dirs_removed with value
RELEVANT_FOR_SELF. However, a few commits ago ("diffcore-rename: check
if we have enough renames for directories early on"), we may have
downgraded all relevant ancestor directories from RELEVANT_FOR_SELF to
RELEVANT_FOR_ANCESTOR.
For a given file, if no ancestor directory is found in dirs_removed with
a value of RELEVANT_FOR_SELF, then we can downgrade
relevant_source[PATH] from RELEVANT_LOCATION to RELEVANT_NO_MORE. This
means we can skip detecting a rename for that particular path (and any
other paths in the same directory).
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 5.680 s ± 0.096 s 5.665 s ± 0.129 s
mega-renames: 13.812 s ± 0.162 s 11.435 s ± 0.158 s
just-one-mega: 506.0 ms ± 3.9 ms 494.2 ms ± 6.1 ms
While this improvement looks rather modest for these testcases (because
all the previous optimizations were sufficient to nearly remove all time
spent in rename detection already), consider this alternative testcase
tweaked from the ones in commit 557ac0350d as follows
<Same initial setup as commit 557ac0350d, then...>
$ git switch -c add-empty-file v5.5
$ >drivers/gpu/drm/i915/new-empty-file
$ git add drivers/gpu/drm/i915/new-empty-file
$ git commit -m "new file"
$ git switch 5.4-rename
$ git cherry-pick --strategy=ort add-empty-file
For this testcase, we see the following improvement:
Before After
pick-empty: 1.936 s ± 0.024 s 688.1 ms ± 4.2 ms
So roughly a factor of 3 speedup. At $DAYJOB, there was a particular
repository and cherry-pick that inspired this optimization; for that
case I saw a speedup factor of 7 with this optimization.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:08 +01:00
|
|
|
|
|
|
|
for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
|
|
|
|
struct diff_filespec *one = rename_src[i].p->one;
|
|
|
|
int val;
|
|
|
|
|
|
|
|
val = strintmap_get(relevant_sources, one->path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* sources that were not found in relevant_sources should
|
|
|
|
* have already been removed by a prior call to
|
|
|
|
* remove_unneeded_paths_from_src()
|
|
|
|
*/
|
|
|
|
assert(val != -1);
|
|
|
|
|
|
|
|
if (val == RELEVANT_LOCATION) {
|
|
|
|
int removable = 1;
|
|
|
|
char *dir = get_dirname(one->path);
|
|
|
|
while (1) {
|
|
|
|
char *freeme = dir;
|
|
|
|
int res = strintmap_get(dirs_removed, dir);
|
|
|
|
|
|
|
|
/* Quit if not found or irrelevant */
|
|
|
|
if (res == NOT_RELEVANT)
|
|
|
|
break;
|
|
|
|
/* If RELEVANT_FOR_SELF, can't remove */
|
|
|
|
if (res == RELEVANT_FOR_SELF) {
|
|
|
|
removable = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* Else continue searching upwards */
|
|
|
|
assert(res == RELEVANT_FOR_ANCESTOR);
|
|
|
|
dir = get_dirname(dir);
|
|
|
|
free(freeme);
|
|
|
|
}
|
|
|
|
free(dir);
|
|
|
|
if (removable) {
|
|
|
|
strintmap_set(relevant_sources, one->path,
|
|
|
|
RELEVANT_NO_MORE);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (new_num_src < i)
|
|
|
|
memcpy(&rename_src[new_num_src], &rename_src[i],
|
|
|
|
sizeof(struct diff_rename_src));
|
|
|
|
new_num_src++;
|
|
|
|
}
|
|
|
|
|
|
|
|
rename_src_nr = new_num_src;
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
}
|
|
|
|
|
2021-02-27 01:30:42 +01:00
|
|
|
void diffcore_rename_extended(struct diff_options *options,
|
2021-03-13 23:22:02 +01:00
|
|
|
struct strintmap *relevant_sources,
|
|
|
|
struct strintmap *dirs_removed,
|
2021-02-27 01:30:42 +01:00
|
|
|
struct strmap *dir_rename_count)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
2005-09-21 09:18:27 +02:00
|
|
|
int detect_rename = options->detect_rename;
|
|
|
|
int minimum_score = options->rename_score;
|
2005-05-22 04:40:36 +02:00
|
|
|
struct diff_queue_struct *q = &diff_queued_diff;
|
2005-09-16 01:13:43 +02:00
|
|
|
struct diff_queue_struct outq;
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_score *mx;
|
2011-01-06 22:50:06 +01:00
|
|
|
int i, j, rename_count, skip_unmodified = 0;
|
2020-12-11 10:08:40 +01:00
|
|
|
int num_destinations, dst_cnt;
|
2021-02-03 21:03:46 +01:00
|
|
|
int num_sources, want_copies;
|
2011-02-20 10:51:16 +01:00
|
|
|
struct progress *progress = NULL;
|
2021-02-27 01:30:40 +01:00
|
|
|
struct dir_rename_info info;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "setup", options->repo);
|
2021-02-27 01:30:40 +01:00
|
|
|
info.setup = 0;
|
2021-02-27 01:30:42 +01:00
|
|
|
assert(!dir_rename_count || strmap_empty(dir_rename_count));
|
2021-02-03 21:03:46 +01:00
|
|
|
want_copies = (detect_rename == DIFF_DETECT_COPY);
|
2021-02-27 01:30:46 +01:00
|
|
|
if (dirs_removed && (break_idx || want_copies))
|
|
|
|
BUG("dirs_removed incompatible with break/copy detection");
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
if (break_idx && relevant_sources)
|
|
|
|
BUG("break detection incompatible with source specification");
|
2005-05-22 08:33:32 +02:00
|
|
|
if (!minimum_score)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
minimum_score = DEFAULT_RENAME_SCORE;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
|
|
|
for (i = 0; i < q->nr; i++) {
|
2005-05-21 11:40:01 +02:00
|
|
|
struct diff_filepair *p = q->queue[i];
|
2006-11-02 09:02:11 +01:00
|
|
|
if (!DIFF_FILE_VALID(p->one)) {
|
2005-05-22 04:42:18 +02:00
|
|
|
if (!DIFF_FILE_VALID(p->two))
|
2005-05-23 06:24:49 +02:00
|
|
|
continue; /* unmerged */
|
2006-11-02 09:02:11 +01:00
|
|
|
else if (options->single_follow &&
|
|
|
|
strcmp(options->single_follow, p->two->path))
|
|
|
|
continue; /* not interested */
|
2017-10-31 19:19:11 +01:00
|
|
|
else if (!options->flags.rename_empty &&
|
2017-05-30 19:31:08 +02:00
|
|
|
is_empty_blob_oid(&p->two->oid))
|
teach diffcore-rename to optionally ignore empty content
Our rename detection is a heuristic, matching pairs of
removed and added files with similar or identical content.
It's unlikely to be wrong when there is actual content to
compare, and we already take care not to do inexact rename
detection when there is not enough content to produce good
results.
However, we always do exact rename detection, even when the
blob is tiny or empty. It's easy to get false positives with
an empty blob, simply because it is an obvious content to
use as a boilerplate (e.g., when telling git that an empty
directory is worth tracking via an empty .gitignore).
This patch lets callers specify whether or not they are
interested in using empty files as rename sources and
destinations. The default is "yes", keeping the original
behavior. It works by detecting the empty-blob sha1 for
rename sources and destinations.
One more flexible alternative would be to allow the caller
to specify a minimum size for a blob to be "interesting" for
rename detection. But that would catch small boilerplate
files, not large ones (e.g., if you had the GPL COPYING file
in many directories).
A better alternative would be to allow a "-rename"
gitattribute to allow boilerplate files to be marked as
such. I'll leave the complexity of that solution until such
time as somebody actually wants it. The complaints we've
seen so far revolve around empty files, so let's start with
the simple thing.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-03-22 23:52:13 +01:00
|
|
|
continue;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
else if (add_rename_dst(p) < 0) {
|
2015-02-27 02:42:27 +01:00
|
|
|
warning("skipping rename detection, detected"
|
|
|
|
" duplicate destination '%s'",
|
|
|
|
p->two->path);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
2006-11-02 09:02:11 +01:00
|
|
|
}
|
2017-10-31 19:19:11 +01:00
|
|
|
else if (!options->flags.rename_empty &&
|
2017-05-30 19:31:08 +02:00
|
|
|
is_empty_blob_oid(&p->one->oid))
|
teach diffcore-rename to optionally ignore empty content
Our rename detection is a heuristic, matching pairs of
removed and added files with similar or identical content.
It's unlikely to be wrong when there is actual content to
compare, and we already take care not to do inexact rename
detection when there is not enough content to produce good
results.
However, we always do exact rename detection, even when the
blob is tiny or empty. It's easy to get false positives with
an empty blob, simply because it is an obvious content to
use as a boilerplate (e.g., when telling git that an empty
directory is worth tracking via an empty .gitignore).
This patch lets callers specify whether or not they are
interested in using empty files as rename sources and
destinations. The default is "yes", keeping the original
behavior. It works by detecting the empty-blob sha1 for
rename sources and destinations.
One more flexible alternative would be to allow the caller
to specify a minimum size for a blob to be "interesting" for
rename detection. But that would catch small boilerplate
files, not large ones (e.g., if you had the GPL COPYING file
in many directories).
A better alternative would be to allow a "-rename"
gitattribute to allow boilerplate files to be marked as
such. I'll leave the complexity of that solution until such
time as somebody actually wants it. The complaints we've
seen so far revolve around empty files, so let's start with
the simple thing.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-03-22 23:52:13 +01:00
|
|
|
continue;
|
diffcore-rename: don't consider unmerged path as source
Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for
unmerged entries., 2007-01-05), an unmerged entry should be detected by
using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of
the filepair records mode=0 entries. However, it forgot to update some
parts of the rename detection logic.
This only makes difference in the "diff --cached" codepath where an
unmerged filepair carries information on the entries that came from the
tree. It probably hasn't been noticed for a long time because nobody
would run "diff -M" during a conflict resolution, but "git status" uses
rename detection when it internally runs "diff-index" and "diff-files"
and gives nonsense results.
In an unmerged pair, "one" side can have a valid filespec to record the
tree entry (e.g. what's in HEAD) when running "diff --cached". This can
be used as a rename source to other paths in the index that are not
unmerged. The path that is unmerged by definition does not have the
final content yet (i.e. "two" side cannot have a valid filespec), so it
can never be a rename destination.
Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and
allow the valid "one" side of an unmerged filepair to be considered a
potential rename source, but never to be considered a rename destination.
Commit message and first two test cases by Junio, the rest by Martin.
Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-24 03:41:01 +01:00
|
|
|
else if (!DIFF_PAIR_UNMERGED(p) && !DIFF_FILE_VALID(p->two)) {
|
2007-10-25 20:20:56 +02:00
|
|
|
/*
|
|
|
|
* If the source is a broken "delete", and
|
2005-06-12 05:55:20 +02:00
|
|
|
* they did not really want to get broken,
|
|
|
|
* that means the source actually stays.
|
2007-10-25 20:20:56 +02:00
|
|
|
* So we increment the "rename_used" score
|
|
|
|
* by one, to indicate ourselves as a user
|
|
|
|
*/
|
|
|
|
if (p->broken_pair && !p->score)
|
|
|
|
p->one->rename_used++;
|
2011-01-06 22:50:05 +01:00
|
|
|
register_rename_src(p);
|
2007-10-25 20:20:56 +02:00
|
|
|
}
|
2021-02-03 21:03:46 +01:00
|
|
|
else if (want_copies) {
|
2007-10-25 20:20:56 +02:00
|
|
|
/*
|
|
|
|
* Increment the "rename_used" score by
|
|
|
|
* one, to indicate ourselves as a user.
|
2005-06-12 05:55:20 +02:00
|
|
|
*/
|
2007-10-25 20:20:56 +02:00
|
|
|
p->one->rename_used++;
|
2011-01-06 22:50:05 +01:00
|
|
|
register_rename_src(p);
|
2005-06-12 05:55:20 +02:00
|
|
|
}
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "setup", options->repo);
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
if (rename_dst_nr == 0 || rename_src_nr == 0)
|
2005-05-21 11:39:09 +02:00
|
|
|
goto cleanup; /* nothing to do */
|
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "exact renames", options->repo);
|
Do exact rename detection regardless of rename limits
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-25 20:24:47 +02:00
|
|
|
/*
|
|
|
|
* We really want to cull the candidates list early
|
|
|
|
* with cheap tests in order to avoid doing deltas.
|
|
|
|
*/
|
2011-02-19 04:55:19 +01:00
|
|
|
rename_count = find_exact_renames(options);
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "exact renames", options->repo);
|
Do exact rename detection regardless of rename limits
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-25 20:24:47 +02:00
|
|
|
|
2007-10-27 01:56:34 +02:00
|
|
|
/* Did we only want exact renames? */
|
|
|
|
if (minimum_score == MAX_SCORE)
|
|
|
|
goto cleanup;
|
|
|
|
|
2021-02-03 21:03:46 +01:00
|
|
|
num_sources = rename_src_nr;
|
2007-10-27 01:56:34 +02:00
|
|
|
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
if (want_copies || break_idx) {
|
|
|
|
/*
|
|
|
|
* Cull sources:
|
|
|
|
* - remove ones corresponding to exact renames
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
* - remove ones not found in relevant_sources
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull after exact", options->repo);
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
remove_unneeded_paths_from_src(want_copies, relevant_sources);
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
trace2_region_leave("diff", "cull after exact", options->repo);
|
|
|
|
} else {
|
|
|
|
/* Determine minimum score to match basenames */
|
|
|
|
double factor = 0.5;
|
|
|
|
char *basename_factor = getenv("GIT_BASENAME_FACTOR");
|
|
|
|
int min_basename_score;
|
|
|
|
|
|
|
|
if (basename_factor)
|
|
|
|
factor = strtol(basename_factor, NULL, 10)/100.0;
|
|
|
|
assert(factor >= 0.0 && factor <= 1.0);
|
|
|
|
min_basename_score = minimum_score +
|
|
|
|
(int)(factor * (MAX_SCORE - minimum_score));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cull sources:
|
|
|
|
* - remove ones involved in renames (found via exact match)
|
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull after exact", options->repo);
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
remove_unneeded_paths_from_src(want_copies, NULL);
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
trace2_region_leave("diff", "cull after exact", options->repo);
|
|
|
|
|
2021-02-27 01:30:41 +01:00
|
|
|
/* Preparation for basename-driven matching. */
|
|
|
|
trace2_region_enter("diff", "dir rename setup", options->repo);
|
2021-03-11 01:38:31 +01:00
|
|
|
initialize_dir_rename_info(&info, relevant_sources,
|
2021-02-27 01:30:46 +01:00
|
|
|
dirs_removed, dir_rename_count);
|
2021-02-27 01:30:41 +01:00
|
|
|
trace2_region_leave("diff", "dir rename setup", options->repo);
|
|
|
|
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
/* Utilize file basenames to quickly find renames. */
|
|
|
|
trace2_region_enter("diff", "basename matches", options->repo);
|
|
|
|
rename_count += find_basename_matches(options,
|
2021-02-27 01:30:40 +01:00
|
|
|
min_basename_score,
|
2021-03-11 01:38:31 +01:00
|
|
|
&info,
|
|
|
|
relevant_sources,
|
|
|
|
dirs_removed);
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
trace2_region_leave("diff", "basename matches", options->repo);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cull sources, again:
|
|
|
|
* - remove ones involved in renames (found via basenames)
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
* - remove ones not found in relevant_sources
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
* and
|
|
|
|
* - remove ones in relevant_sources which are needed only
|
|
|
|
* for directory renames IF no ancestory directory
|
|
|
|
* actually needs to know any more individual path
|
|
|
|
* renames under them
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull basename", options->repo);
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
remove_unneeded_paths_from_src(want_copies, relevant_sources);
|
diffcore-rename: take advantage of "majority rules" to skip more renames
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.
Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.
However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already. Add a function meant to compute
whether particular renames are still needed based on this majority rules
check. The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.
An important side note for future further optimization:
There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories. In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection. Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely. However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted. If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 23:22:01 +01:00
|
|
|
handle_early_known_dir_renames(&info, relevant_sources,
|
|
|
|
dirs_removed);
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
trace2_region_leave("diff", "cull basename", options->repo);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Calculate how many rename destinations are left */
|
|
|
|
num_destinations = (rename_dst_nr - rename_count);
|
|
|
|
num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
|
|
|
|
|
2007-10-27 01:56:34 +02:00
|
|
|
/* All done? */
|
2021-02-03 21:03:46 +01:00
|
|
|
if (!num_destinations || !num_sources)
|
2007-10-27 01:56:34 +02:00
|
|
|
goto cleanup;
|
|
|
|
|
2021-02-03 21:03:46 +01:00
|
|
|
switch (too_many_rename_candidates(num_destinations, num_sources,
|
2020-12-11 10:08:41 +01:00
|
|
|
options)) {
|
2011-01-06 22:50:06 +01:00
|
|
|
case 1:
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
goto cleanup;
|
2011-01-06 22:50:06 +01:00
|
|
|
case 2:
|
|
|
|
options->degraded_cc_to_c = 1;
|
|
|
|
skip_unmodified = 1;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "inexact renames", options->repo);
|
2011-02-20 10:51:16 +01:00
|
|
|
if (options->show_rename_progress) {
|
progress: simplify "delayed" progress API
We used to expose the full power of the delayed progress API to the
callers, so that they can specify, not just the message to show and
expected total amount of work that is used to compute the percentage
of work performed so far, the percent-threshold parameter P and the
delay-seconds parameter N. The progress meter starts to show at N
seconds into the operation only if we have not yet completed P per-cent
of the total work.
Most callers used either (0%, 2s) or (50%, 1s) as (P, N), but there
are oddballs that chose more random-looking values like 95%.
For a smoother workload, (50%, 1s) would allow us to start showing
the progress meter earlier than (0%, 2s), while keeping the chance
of not showing progress meter for long running operation the same as
the latter. For a task that would take 2s or more to complete, it
is likely that less than half of it would complete within the first
second, if the workload is smooth. But for a spiky workload whose
earlier part is easier, such a setting is likely to fail to show the
progress meter entirely and (0%, 2s) is more appropriate.
But that is merely a theory. Realistically, it is of dubious value
to ask each codepath to carefully consider smoothness of their
workload and specify their own setting by passing two extra
parameters. Let's simplify the API by dropping both parameters and
have everybody use (0%, 2s).
Oh, by the way, the percent-threshold parameter and the structure
member were consistently misspelled, which also is now fixed ;-)
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-19 19:39:41 +02:00
|
|
|
progress = start_delayed_progress(
|
2014-02-21 13:50:18 +01:00
|
|
|
_("Performing inexact rename detection"),
|
2021-02-03 21:03:46 +01:00
|
|
|
(uint64_t)num_destinations * (uint64_t)num_sources);
|
2011-02-20 10:51:16 +01:00
|
|
|
}
|
|
|
|
|
2021-03-13 17:17:22 +01:00
|
|
|
CALLOC_ARRAY(mx, st_mult(NUM_CANDIDATE_PER_DST, num_destinations));
|
2005-05-24 10:10:48 +02:00
|
|
|
for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filespec *two = rename_dst[i].p->two;
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
struct diff_score *m;
|
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].is_rename)
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
continue; /* exact or basename match already handled */
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
|
|
|
m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
|
|
|
|
for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
|
|
|
|
m[j].dst = -1;
|
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
for (j = 0; j < rename_src_nr; j++) {
|
2011-01-06 22:50:05 +01:00
|
|
|
struct diff_filespec *one = rename_src[j].p->one;
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
struct diff_score this_src;
|
2011-01-06 22:50:06 +01:00
|
|
|
|
2021-02-14 08:35:01 +01:00
|
|
|
assert(!one->rename_used || want_copies || break_idx);
|
2021-02-03 21:03:46 +01:00
|
|
|
|
2011-01-06 22:50:06 +01:00
|
|
|
if (skip_unmodified &&
|
|
|
|
diff_unmodified_pair(rename_src[j].p))
|
|
|
|
continue;
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
this_src.score = estimate_similarity(options->repo,
|
|
|
|
one, two,
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
minimum_score,
|
|
|
|
skip_unmodified);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
this_src.name_score = basename_same(one, two);
|
|
|
|
this_src.dst = i;
|
|
|
|
this_src.src = j;
|
|
|
|
record_if_better(m, &this_src);
|
2009-11-21 07:13:47 +01:00
|
|
|
/*
|
|
|
|
* Once we run estimate_similarity,
|
|
|
|
* We do not need the text anymore.
|
|
|
|
*/
|
2007-10-03 06:01:03 +02:00
|
|
|
diff_free_filespec_blob(one);
|
2009-11-21 07:13:47 +01:00
|
|
|
diff_free_filespec_blob(two);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
dst_cnt++;
|
2020-12-11 10:08:43 +01:00
|
|
|
display_progress(progress,
|
2021-02-03 21:03:46 +01:00
|
|
|
(uint64_t)dst_cnt * (uint64_t)num_sources);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2011-02-20 10:51:16 +01:00
|
|
|
stop_progress(&progress);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
/* cost matrix sorted by most to least similar pair */
|
2019-09-30 19:21:55 +02:00
|
|
|
STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
2021-02-27 01:30:46 +01:00
|
|
|
rename_count += find_renames(mx, dst_cnt, minimum_score, 0,
|
|
|
|
&info, dirs_removed);
|
2021-02-03 21:03:46 +01:00
|
|
|
if (want_copies)
|
2021-02-27 01:30:46 +01:00
|
|
|
rename_count += find_renames(mx, dst_cnt, minimum_score, 1,
|
|
|
|
&info, dirs_removed);
|
2005-05-21 11:39:09 +02:00
|
|
|
free(mx);
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "inexact renames", options->repo);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-28 00:55:55 +02:00
|
|
|
cleanup:
|
2005-05-21 11:39:09 +02:00
|
|
|
/* At this point, we have found some renames and copies and they
|
2005-09-16 01:13:43 +02:00
|
|
|
* are recorded in rename_dst. The original list is still in *q.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "write back to queue", options->repo);
|
2010-05-07 06:52:27 +02:00
|
|
|
DIFF_QUEUE_CLEAR(&outq);
|
2005-05-21 11:39:09 +02:00
|
|
|
for (i = 0; i < q->nr; i++) {
|
2005-05-24 10:10:48 +02:00
|
|
|
struct diff_filepair *p = q->queue[i];
|
|
|
|
struct diff_filepair *pair_to_free = NULL;
|
|
|
|
|
diffcore-rename: don't consider unmerged path as source
Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for
unmerged entries., 2007-01-05), an unmerged entry should be detected by
using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of
the filepair records mode=0 entries. However, it forgot to update some
parts of the rename detection logic.
This only makes difference in the "diff --cached" codepath where an
unmerged filepair carries information on the entries that came from the
tree. It probably hasn't been noticed for a long time because nobody
would run "diff -M" during a conflict resolution, but "git status" uses
rename detection when it internally runs "diff-index" and "diff-files"
and gives nonsense results.
In an unmerged pair, "one" side can have a valid filespec to record the
tree entry (e.g. what's in HEAD) when running "diff --cached". This can
be used as a rename source to other paths in the index that are not
unmerged. The path that is unmerged by definition does not have the
final content yet (i.e. "two" side cannot have a valid filespec), so it
can never be a rename destination.
Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and
allow the valid "one" side of an unmerged filepair to be considered a
potential rename source, but never to be considered a rename destination.
Commit message and first two test cases by Junio, the rest by Martin.
Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-24 03:41:01 +01:00
|
|
|
if (DIFF_PAIR_UNMERGED(p)) {
|
|
|
|
diff_q(&outq, p);
|
|
|
|
}
|
|
|
|
else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Creation */
|
|
|
|
diff_q(&outq, p);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2005-05-30 09:08:07 +02:00
|
|
|
else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
|
|
|
|
/*
|
|
|
|
* Deletion
|
|
|
|
*
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
* We would output this delete record if:
|
|
|
|
*
|
|
|
|
* (1) this is a broken delete and the counterpart
|
|
|
|
* broken create remains in the output; or
|
2005-09-16 01:13:43 +02:00
|
|
|
* (2) this is not a broken delete, and rename_dst
|
|
|
|
* does not have a rename/copy to move p->one->path
|
|
|
|
* out of existence.
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
*
|
|
|
|
* Otherwise, the counterpart broken create
|
|
|
|
* has been turned into a rename-edit; or
|
|
|
|
* delete did not have a matching create to
|
|
|
|
* begin with.
|
2005-05-30 09:08:07 +02:00
|
|
|
*/
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
if (DIFF_PAIR_BROKEN(p)) {
|
|
|
|
/* broken delete */
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_rename_dst *dst = locate_rename_dst(p);
|
|
|
|
if (!dst)
|
|
|
|
BUG("tracking failed somehow; failed to find associated dst for broken pair");
|
|
|
|
if (dst->is_rename)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
/* counterpart is now rename/copy */
|
|
|
|
pair_to_free = p;
|
|
|
|
}
|
|
|
|
else {
|
2007-10-25 20:20:56 +02:00
|
|
|
if (p->one->rename_used)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
/* this path remains */
|
|
|
|
pair_to_free = p;
|
|
|
|
}
|
2005-05-30 09:08:07 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (!pair_to_free)
|
2005-05-30 09:08:07 +02:00
|
|
|
diff_q(&outq, p);
|
|
|
|
}
|
2005-05-24 10:10:48 +02:00
|
|
|
else if (!diff_unmodified_pair(p))
|
2005-05-28 00:55:55 +02:00
|
|
|
/* all the usual ones need to be kept */
|
2005-05-24 10:10:48 +02:00
|
|
|
diff_q(&outq, p);
|
2005-05-28 00:55:55 +02:00
|
|
|
else
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* no need to keep unmodified pairs; FIXME: remove earlier? */
|
2005-05-28 00:55:55 +02:00
|
|
|
pair_to_free = p;
|
|
|
|
|
2005-05-28 00:50:30 +02:00
|
|
|
if (pair_to_free)
|
|
|
|
diff_free_filepair(pair_to_free);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2005-05-24 10:10:48 +02:00
|
|
|
diff_debug_queue("done copying original", &outq);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
free(q->queue);
|
|
|
|
*q = outq;
|
|
|
|
diff_debug_queue("done collapsing", q);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2007-10-25 20:19:10 +02:00
|
|
|
for (i = 0; i < rename_dst_nr; i++)
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].filespec_to_free)
|
|
|
|
free_filespec(rename_dst[i].filespec_to_free);
|
2005-09-16 01:13:43 +02:00
|
|
|
|
2021-02-27 01:30:45 +01:00
|
|
|
cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
|
2017-06-16 01:15:46 +02:00
|
|
|
FREE_AND_NULL(rename_dst);
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_dst_nr = rename_dst_alloc = 0;
|
2017-06-16 01:15:46 +02:00
|
|
|
FREE_AND_NULL(rename_src);
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_src_nr = rename_src_alloc = 0;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (break_idx) {
|
|
|
|
strintmap_clear(break_idx);
|
|
|
|
FREE_AND_NULL(break_idx);
|
|
|
|
}
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "write back to queue", options->repo);
|
2005-05-21 11:39:09 +02:00
|
|
|
return;
|
|
|
|
}
|
2021-02-27 01:30:42 +01:00
|
|
|
|
|
|
|
void diffcore_rename(struct diff_options *options)
|
|
|
|
{
|
diffcore-rename: enable filtering possible rename sources
Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.
There are two important pieces to this optimization that may not be
obvious at first glance:
* We do not require callers to just filter the filepairs out
to remove the non-relevant sources, because exact rename detection
is fast and when it finds a match it can remove both a source and a
destination whereas the relevant_sources filter can only remove a
source.
* We need to filter out the source pairs in a preliminary pass instead
of adding a
strset_contains(relevant_sources, one->path)
check within the nested matrix loop. The reason for that is if we
have 30k renames, doing 30k * 30k = 900M strset_contains() calls
becomes extraordinarily expensive and defeats the performance gains
from this change; we only want to do 30k such calls instead.
If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant. Since all callers currently pass NULL,
this optimization does not yet have any effect. Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().
A note about filtering order:
Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames. While that technically could be
done at this point, there are two reasons to defer it:
First, was to reinforce a lesson that was too easy to forget. As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up. Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering. That slowed things down a
lot. But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again. So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.
Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant. It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones. Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either. It has to be deferred. So the deferral is in part to simplify
some later additions.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-11 01:38:24 +01:00
|
|
|
diffcore_rename_extended(options, NULL, NULL, NULL);
|
2021-02-27 01:30:42 +01:00
|
|
|
}
|