2005-05-21 11:39:09 +02:00
|
|
|
/*
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
*
|
2005-05-21 11:39:09 +02:00
|
|
|
* Copyright (C) 2005 Junio C Hamano
|
|
|
|
*/
|
|
|
|
#include "cache.h"
|
|
|
|
#include "diff.h"
|
|
|
|
#include "diffcore.h"
|
2018-05-16 01:42:15 +02:00
|
|
|
#include "object-store.h"
|
2013-11-14 20:20:26 +01:00
|
|
|
#include "hashmap.h"
|
2011-02-20 10:51:16 +01:00
|
|
|
#include "progress.h"
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
#include "promisor-remote.h"
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
#include "strmap.h"
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
/* Table of rename/copy destinations */
|
|
|
|
|
|
|
|
static struct diff_rename_dst {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filepair *p;
|
|
|
|
struct diff_filespec *filespec_to_free;
|
|
|
|
int is_rename; /* false -> just a create; true -> rename or copy */
|
2005-05-24 10:10:48 +02:00
|
|
|
} *rename_dst;
|
|
|
|
static int rename_dst_nr, rename_dst_alloc;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Mapping from break source pathname to break destination index */
|
|
|
|
static struct strintmap *break_idx = NULL;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
|
2015-02-27 02:39:48 +01:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Lookup by p->ONE->path */
|
|
|
|
int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
|
|
|
|
return (idx == -1) ? NULL : &rename_dst[idx];
|
2015-02-27 02:39:48 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns 0 on success, -1 if we found a duplicate.
|
|
|
|
*/
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
static int add_rename_dst(struct diff_filepair *p)
|
2015-02-27 02:39:48 +01:00
|
|
|
{
|
2014-03-03 23:31:54 +01:00
|
|
|
ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[rename_dst_nr].p = p;
|
|
|
|
rename_dst[rename_dst_nr].filespec_to_free = NULL;
|
|
|
|
rename_dst[rename_dst_nr].is_rename = 0;
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_dst_nr++;
|
2015-02-27 02:39:48 +01:00
|
|
|
return 0;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
2005-05-28 00:55:55 +02:00
|
|
|
/* Table of rename/copy src files */
|
2005-05-24 10:10:48 +02:00
|
|
|
static struct diff_rename_src {
|
2011-01-06 22:50:05 +01:00
|
|
|
struct diff_filepair *p;
|
2006-04-09 05:17:46 +02:00
|
|
|
unsigned short score; /* to remember the break score */
|
2005-05-24 10:10:48 +02:00
|
|
|
} *rename_src;
|
|
|
|
static int rename_src_nr, rename_src_alloc;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2020-12-11 10:08:46 +01:00
|
|
|
static void register_rename_src(struct diff_filepair *p)
|
2005-05-24 10:10:48 +02:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (p->broken_pair) {
|
|
|
|
if (!break_idx) {
|
|
|
|
break_idx = xmalloc(sizeof(*break_idx));
|
|
|
|
strintmap_init(break_idx, -1);
|
|
|
|
}
|
|
|
|
strintmap_set(break_idx, p->one->path, rename_dst_nr);
|
|
|
|
}
|
|
|
|
|
2014-03-03 23:31:54 +01:00
|
|
|
ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
|
2020-12-11 10:08:46 +01:00
|
|
|
rename_src[rename_src_nr].p = p;
|
|
|
|
rename_src[rename_src_nr].score = p->score;
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_src_nr++;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
2007-06-21 13:52:11 +02:00
|
|
|
static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
|
|
|
|
{
|
|
|
|
int src_len = strlen(src->path), dst_len = strlen(dst->path);
|
|
|
|
while (src_len && dst_len) {
|
|
|
|
char c1 = src->path[--src_len];
|
|
|
|
char c2 = dst->path[--dst_len];
|
|
|
|
if (c1 != c2)
|
|
|
|
return 0;
|
|
|
|
if (c1 == '/')
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return (!src_len || src->path[src_len - 1] == '/') &&
|
|
|
|
(!dst_len || dst->path[dst_len - 1] == '/');
|
|
|
|
}
|
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_score {
|
2005-05-24 10:10:48 +02:00
|
|
|
int src; /* index in rename_src */
|
|
|
|
int dst; /* index in rename_dst */
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
unsigned short score;
|
|
|
|
short name_score;
|
2005-05-21 11:39:09 +02:00
|
|
|
};
|
|
|
|
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
struct prefetch_options {
|
|
|
|
struct repository *repo;
|
|
|
|
int skip_unmodified;
|
|
|
|
};
|
|
|
|
static void prefetch(void *prefetch_options)
|
|
|
|
{
|
|
|
|
struct prefetch_options *options = prefetch_options;
|
|
|
|
int i;
|
|
|
|
struct oid_array to_fetch = OID_ARRAY_INIT;
|
|
|
|
|
|
|
|
for (i = 0; i < rename_dst_nr; i++) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].p->renamed_pair)
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
/*
|
|
|
|
* The loop in diffcore_rename() will not need these
|
|
|
|
* blobs, so skip prefetching.
|
|
|
|
*/
|
|
|
|
continue; /* already found exact match */
|
|
|
|
diff_add_if_missing(options->repo, &to_fetch,
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[i].p->two);
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
}
|
|
|
|
for (i = 0; i < rename_src_nr; i++) {
|
|
|
|
if (options->skip_unmodified &&
|
|
|
|
diff_unmodified_pair(rename_src[i].p))
|
|
|
|
/*
|
|
|
|
* The loop in diffcore_rename() will not need these
|
|
|
|
* blobs, so skip prefetching.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
diff_add_if_missing(options->repo, &to_fetch,
|
|
|
|
rename_src[i].p->one);
|
|
|
|
}
|
|
|
|
promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr);
|
|
|
|
oid_array_clear(&to_fetch);
|
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static int estimate_similarity(struct repository *r,
|
|
|
|
struct diff_filespec *src,
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_filespec *dst,
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
int minimum_score,
|
|
|
|
int skip_unmodified)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
|
|
|
/* src points at a file that existed in the original tree (or
|
|
|
|
* optionally a file in the destination tree) and dst points
|
|
|
|
* at a newly created file. They may be quite similar, in which
|
|
|
|
* case we want to say src is renamed to dst or src is copied into
|
|
|
|
* dst, and then some edit has been applied to dst.
|
|
|
|
*
|
|
|
|
* Compare them and return how similar they are, representing
|
2005-05-28 00:56:38 +02:00
|
|
|
* the score as an integer between 0 and MAX_SCORE.
|
|
|
|
*
|
|
|
|
* When there is an exact match, it is considered a better
|
|
|
|
* match than anything else; the destination does not even
|
|
|
|
* call into this function in that case.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2006-03-13 07:26:34 +01:00
|
|
|
unsigned long max_size, delta_size, base_size, src_copied, literal_added;
|
2005-05-21 11:39:09 +02:00
|
|
|
int score;
|
2020-04-08 00:11:41 +02:00
|
|
|
struct diff_populate_filespec_options dpf_options = {
|
|
|
|
.check_size_only = 1
|
|
|
|
};
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
struct prefetch_options prefetch_options = {r, skip_unmodified};
|
|
|
|
|
|
|
|
if (r == the_repository && has_promisor_remote()) {
|
|
|
|
dpf_options.missing_object_cb = prefetch;
|
|
|
|
dpf_options.missing_object_data = &prefetch_options;
|
|
|
|
}
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-23 06:24:49 +02:00
|
|
|
/* We deal only with regular files. Symlink renames are handled
|
|
|
|
* only when they are exact matches --- in other words, no edits
|
|
|
|
* after renaming.
|
|
|
|
*/
|
|
|
|
if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
|
|
|
|
return 0;
|
|
|
|
|
2007-10-27 01:51:28 +02:00
|
|
|
/*
|
|
|
|
* Need to check that source and destination sizes are
|
|
|
|
* filled in before comparing them.
|
|
|
|
*
|
|
|
|
* If we already have "cnt_data" filled in, we know it's
|
|
|
|
* all good (avoid checking the size for zero, as that
|
|
|
|
* is a possible size - we really should have a flag to
|
|
|
|
* say whether the size is valid or not!)
|
|
|
|
*/
|
2014-08-16 05:08:04 +02:00
|
|
|
if (!src->cnt_data &&
|
2020-04-08 00:11:41 +02:00
|
|
|
diff_populate_filespec(r, src, &dpf_options))
|
2007-10-27 01:51:28 +02:00
|
|
|
return 0;
|
2014-08-16 05:08:04 +02:00
|
|
|
if (!dst->cnt_data &&
|
2020-04-08 00:11:41 +02:00
|
|
|
diff_populate_filespec(r, dst, &dpf_options))
|
2007-10-27 01:51:28 +02:00
|
|
|
return 0;
|
|
|
|
|
2006-03-13 07:26:34 +01:00
|
|
|
max_size = ((src->size > dst->size) ? src->size : dst->size);
|
2005-05-22 00:55:18 +02:00
|
|
|
base_size = ((src->size < dst->size) ? src->size : dst->size);
|
2006-03-13 07:26:34 +01:00
|
|
|
delta_size = max_size - base_size;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-22 00:55:18 +02:00
|
|
|
/* We would not consider edits that change the file size so
|
|
|
|
* drastically. delta_size must be smaller than
|
2005-05-22 10:31:28 +02:00
|
|
|
* (MAX_SCORE-minimum_score)/MAX_SCORE * min(src->size, dst->size).
|
2005-05-28 00:56:38 +02:00
|
|
|
*
|
2005-05-22 00:55:18 +02:00
|
|
|
* Note that base_size == 0 case is handled here already
|
|
|
|
* and the final score computation below would not have a
|
|
|
|
* divide-by-zero issue.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2011-02-19 05:12:06 +01:00
|
|
|
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
|
2005-05-21 11:39:09 +02:00
|
|
|
return 0;
|
|
|
|
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
dpf_options.check_size_only = 0;
|
|
|
|
|
|
|
|
if (!src->cnt_data && diff_populate_filespec(r, src, &dpf_options))
|
2009-01-20 16:59:57 +01:00
|
|
|
return 0;
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
if (!dst->cnt_data && diff_populate_filespec(r, dst, &dpf_options))
|
2009-01-20 16:59:57 +01:00
|
|
|
return 0;
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
if (diffcore_count_changes(r, src, dst,
|
2006-03-12 12:22:10 +01:00
|
|
|
&src->cnt_data, &dst->cnt_data,
|
2006-03-01 01:01:36 +01:00
|
|
|
&src_copied, &literal_added))
|
2005-05-24 21:09:32 +02:00
|
|
|
return 0;
|
2005-06-03 10:36:03 +02:00
|
|
|
|
2006-03-03 07:11:25 +01:00
|
|
|
/* How similar are they?
|
|
|
|
* what percentage of material in dst are from source?
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
2006-03-13 07:26:34 +01:00
|
|
|
if (!dst->size)
|
2006-03-03 07:11:25 +01:00
|
|
|
score = 0; /* should not happen */
|
2007-06-25 00:23:28 +02:00
|
|
|
else
|
2007-03-07 02:44:37 +01:00
|
|
|
score = (int)(src_copied * MAX_SCORE / max_size);
|
2005-05-21 11:39:09 +02:00
|
|
|
return score;
|
|
|
|
}
|
|
|
|
|
2005-09-16 01:13:43 +02:00
|
|
|
static void record_rename_pair(int dst_index, int src_index, int score)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filepair *src = rename_src[src_index].p;
|
|
|
|
struct diff_filepair *dst = rename_dst[dst_index].p;
|
[PATCH] Rename/copy detection fix.
The rename/copy detection logic in earlier round was only good
enough to show patch output and discussion on the mailing list
about the diff-raw format updates revealed many problems with
it. This patch fixes all the ones known to me, without making
things I want to do later impossible, mostly related to patch
reordering.
(1) Earlier rename/copy detector determined which one is rename
and which one is copy too early, which made it impossible
to later introduce diffcore transformers to reorder
patches. This patch fixes it by moving that logic to the
very end of the processing.
(2) Earlier output routine diff_flush() was pruning all the
"no-change" entries indiscriminatingly. This was done due
to my false assumption that one of the requirements in the
diff-raw output was not to show such an entry (which
resulted in my incorrect comment about "diff-helper never
being able to be equivalent to built-in diff driver"). My
special thanks go to Linus for correcting me about this.
When we produce diff-raw output, for the downstream to be
able to tell renames from copies, sometimes it _is_
necessary to output "no-change" entries, and this patch
adds diffcore_prune() function for doing it.
(3) Earlier diff_filepair structure was trying to be not too
specific about rename/copy operations, but the purpose of
the structure was to record one or two paths, which _was_
indeed about rename/copy. This patch discards xfrm_msg
field which was trying to be generic for this wrong reason,
and introduces a couple of fields (rename_score and
rename_rank) that are explicitly specific to rename/copy
logic. One thing to note is that the information in a
single diff_filepair structure _still_ does not distinguish
renames from copies, and it is deliberately so. This is to
allow patches to be reordered in later stages.
(4) This patch also adds some tests about diff-raw format
output and makes sure that necessary "no-change" entries
appear on the output.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-23 06:26:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (dst->renamed_pair)
|
2005-05-24 10:10:48 +02:00
|
|
|
die("internal error: dst already matched.");
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
src->one->rename_used++;
|
|
|
|
src->one->count++;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
rename_dst[dst_index].filespec_to_free = dst->one;
|
|
|
|
rename_dst[dst_index].is_rename = 1;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
dst->one = src->one;
|
|
|
|
dst->renamed_pair = 1;
|
|
|
|
if (!strcmp(dst->one->path, dst->two->path))
|
|
|
|
dst->score = rename_src[src_index].score;
|
2006-04-09 05:17:46 +02:00
|
|
|
else
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
dst->score = score;
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We sort the rename similarity matrix with the score, in descending
|
2005-05-28 00:55:55 +02:00
|
|
|
* order (the most similar first).
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
|
|
|
static int score_compare(const void *a_, const void *b_)
|
|
|
|
{
|
|
|
|
const struct diff_score *a = a_, *b = b_;
|
2007-06-25 00:23:28 +02:00
|
|
|
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
/* sink the unused ones to the bottom */
|
|
|
|
if (a->dst < 0)
|
|
|
|
return (0 <= b->dst);
|
|
|
|
else if (b->dst < 0)
|
|
|
|
return -1;
|
|
|
|
|
2007-06-25 00:23:28 +02:00
|
|
|
if (a->score == b->score)
|
|
|
|
return b->name_score - a->name_score;
|
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
return b->score - a->score;
|
|
|
|
}
|
|
|
|
|
2007-10-25 20:23:26 +02:00
|
|
|
struct file_similarity {
|
2013-11-14 20:20:26 +01:00
|
|
|
struct hashmap_entry entry;
|
2013-11-14 20:19:34 +01:00
|
|
|
int index;
|
2007-10-25 20:23:26 +02:00
|
|
|
struct diff_filespec *filespec;
|
|
|
|
};
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static unsigned int hash_filespec(struct repository *r,
|
|
|
|
struct diff_filespec *filespec)
|
2013-11-14 20:19:04 +01:00
|
|
|
{
|
2016-06-25 01:09:24 +02:00
|
|
|
if (!filespec->oid_valid) {
|
2020-04-08 00:11:41 +02:00
|
|
|
if (diff_populate_filespec(r, filespec, NULL))
|
2013-11-14 20:19:04 +01:00
|
|
|
return 0;
|
2020-01-30 21:32:22 +01:00
|
|
|
hash_object_file(r->hash_algo, filespec->data, filespec->size,
|
|
|
|
"blob", &filespec->oid);
|
2013-11-14 20:19:04 +01:00
|
|
|
}
|
2019-06-20 09:41:49 +02:00
|
|
|
return oidhash(&filespec->oid);
|
2013-11-14 20:19:04 +01:00
|
|
|
}
|
|
|
|
|
2013-11-14 20:20:26 +01:00
|
|
|
static int find_identical_files(struct hashmap *srcs,
|
2013-11-14 20:19:34 +01:00
|
|
|
int dst_index,
|
2011-02-19 04:55:19 +01:00
|
|
|
struct diff_options *options)
|
2007-10-25 20:23:26 +02:00
|
|
|
{
|
|
|
|
int renames = 0;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filespec *target = rename_dst[dst_index].p->two;
|
2014-07-03 00:22:11 +02:00
|
|
|
struct file_similarity *p, *best = NULL;
|
2013-11-14 20:19:04 +01:00
|
|
|
int i = 100, best_score = -1;
|
2019-10-07 01:30:35 +02:00
|
|
|
unsigned int hash = hash_filespec(options->repo, target);
|
2013-11-14 20:19:04 +01:00
|
|
|
|
|
|
|
/*
|
2013-11-14 20:19:34 +01:00
|
|
|
* Find the best source match for specified destination.
|
2013-11-14 20:19:04 +01:00
|
|
|
*/
|
2019-10-07 01:30:35 +02:00
|
|
|
p = hashmap_get_entry_from_hash(srcs, hash, NULL,
|
|
|
|
struct file_similarity, entry);
|
2019-10-07 01:30:41 +02:00
|
|
|
hashmap_for_each_entry_from(srcs, p, entry) {
|
2013-11-14 20:19:04 +01:00
|
|
|
int score;
|
|
|
|
struct diff_filespec *source = p->filespec;
|
|
|
|
|
|
|
|
/* False hash collision? */
|
2018-08-28 23:22:48 +02:00
|
|
|
if (!oideq(&source->oid, &target->oid))
|
2013-11-14 20:19:04 +01:00
|
|
|
continue;
|
|
|
|
/* Non-regular files? If so, the modes must match! */
|
|
|
|
if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
|
|
|
|
if (source->mode != target->mode)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue;
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
2013-11-14 20:19:04 +01:00
|
|
|
/* Give higher scores to sources that haven't been used already */
|
|
|
|
score = !source->rename_used;
|
|
|
|
if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
|
|
|
|
continue;
|
|
|
|
score += basename_same(source, target);
|
|
|
|
if (score > best_score) {
|
|
|
|
best = p;
|
|
|
|
best_score = score;
|
|
|
|
if (score == 2)
|
|
|
|
break;
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
2013-11-14 20:19:04 +01:00
|
|
|
|
|
|
|
/* Too many identical alternatives? Pick one */
|
|
|
|
if (!--i)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (best) {
|
2013-11-14 20:19:34 +01:00
|
|
|
record_rename_pair(dst_index, best->index, MAX_SCORE);
|
2013-11-14 20:19:04 +01:00
|
|
|
renames++;
|
|
|
|
}
|
2007-10-25 20:23:26 +02:00
|
|
|
return renames;
|
|
|
|
}
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
static void insert_file_table(struct repository *r,
|
|
|
|
struct hashmap *table, int index,
|
|
|
|
struct diff_filespec *filespec)
|
2007-10-25 20:23:26 +02:00
|
|
|
{
|
|
|
|
struct file_similarity *entry = xmalloc(sizeof(*entry));
|
|
|
|
|
|
|
|
entry->index = index;
|
|
|
|
entry->filespec = filespec;
|
|
|
|
|
2019-10-07 01:30:27 +02:00
|
|
|
hashmap_entry_init(&entry->entry, hash_filespec(r, filespec));
|
2019-10-07 01:30:29 +02:00
|
|
|
hashmap_add(table, &entry->entry);
|
2007-10-25 20:23:26 +02:00
|
|
|
}
|
|
|
|
|
2007-10-25 20:17:55 +02:00
|
|
|
/*
|
|
|
|
* Find exact renames first.
|
|
|
|
*
|
|
|
|
* The first round matches up the up-to-date entries,
|
|
|
|
* and then during the second round we try to match
|
|
|
|
* cache-dirty entries as well.
|
|
|
|
*/
|
2011-02-19 04:55:19 +01:00
|
|
|
static int find_exact_renames(struct diff_options *options)
|
2007-10-25 20:17:55 +02:00
|
|
|
{
|
2013-11-14 20:19:34 +01:00
|
|
|
int i, renames = 0;
|
2013-11-14 20:20:26 +01:00
|
|
|
struct hashmap file_table;
|
2007-10-25 20:17:55 +02:00
|
|
|
|
diffcore: fix iteration order of identical files during rename detection
If the two paths 'dir/A/file' and 'dir/B/file' have identical content
and the parent directory is renamed, e.g. 'git mv dir other-dir', then
diffcore reports the following exact renames:
renamed: dir/B/file -> other-dir/A/file
renamed: dir/A/file -> other-dir/B/file
While technically not wrong, this is confusing not only for the user,
but also for git commands that make decisions based on rename
information, e.g. 'git log --follow other-dir/A/file' follows
'dir/B/file' past the rename.
This behavior is a side effect of commit v2.0.0-rc4~8^2~14
(diffcore-rename.c: simplify finding exact renames, 2013-11-14): the
hashmap storing sources returns entries from the same bucket, i.e.
sources matching the current destination, in LIFO order. Thus the
iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon
finding identical content and basename, reports an exact rename.
Other hashmap users are apparently happy with the current iteration
order over the entries of a bucket. Changing the iteration order
would risk upsetting other hashmap users and would increase the memory
footprint of each bucket by a pointer to the tail element.
Fill the hashmap with source entries in reverse order to restore the
original exact rename detection behavior.
Reported-by: Bill Okara <billokara@gmail.com>
Signed-off-by: SZEDER Gábor <szeder@ira.uka.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-03-30 10:35:07 +02:00
|
|
|
/* Add all sources to the hash table in reverse order, because
|
|
|
|
* later on they will be retrieved in LIFO order.
|
|
|
|
*/
|
2017-06-30 21:14:05 +02:00
|
|
|
hashmap_init(&file_table, NULL, NULL, rename_src_nr);
|
diffcore: fix iteration order of identical files during rename detection
If the two paths 'dir/A/file' and 'dir/B/file' have identical content
and the parent directory is renamed, e.g. 'git mv dir other-dir', then
diffcore reports the following exact renames:
renamed: dir/B/file -> other-dir/A/file
renamed: dir/A/file -> other-dir/B/file
While technically not wrong, this is confusing not only for the user,
but also for git commands that make decisions based on rename
information, e.g. 'git log --follow other-dir/A/file' follows
'dir/B/file' past the rename.
This behavior is a side effect of commit v2.0.0-rc4~8^2~14
(diffcore-rename.c: simplify finding exact renames, 2013-11-14): the
hashmap storing sources returns entries from the same bucket, i.e.
sources matching the current destination, in LIFO order. Thus the
iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon
finding identical content and basename, reports an exact rename.
Other hashmap users are apparently happy with the current iteration
order over the entries of a bucket. Changing the iteration order
would risk upsetting other hashmap users and would increase the memory
footprint of each bucket by a pointer to the tail element.
Fill the hashmap with source entries in reverse order to restore the
original exact rename detection behavior.
Reported-by: Bill Okara <billokara@gmail.com>
Signed-off-by: SZEDER Gábor <szeder@ira.uka.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-03-30 10:35:07 +02:00
|
|
|
for (i = rename_src_nr-1; i >= 0; i--)
|
2018-09-21 17:57:19 +02:00
|
|
|
insert_file_table(options->repo,
|
|
|
|
&file_table, i,
|
|
|
|
rename_src[i].p->one);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:19:34 +01:00
|
|
|
/* Walk the destinations and find best source match */
|
2007-10-25 20:23:26 +02:00
|
|
|
for (i = 0; i < rename_dst_nr; i++)
|
2013-11-14 20:19:34 +01:00
|
|
|
renames += find_identical_files(&file_table, i, options);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:20:26 +01:00
|
|
|
/* Free the hash data structure and entries */
|
2020-11-02 19:55:05 +01:00
|
|
|
hashmap_clear_and_free(&file_table, struct file_similarity, entry);
|
2007-10-25 20:23:26 +02:00
|
|
|
|
2013-11-14 20:19:34 +01:00
|
|
|
return renames;
|
2007-10-25 20:17:55 +02:00
|
|
|
}
|
|
|
|
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
static const char *get_basename(const char *filename)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* gitbasename() has to worry about special drives, multiple
|
|
|
|
* directory separator characters, trailing slashes, NULL or
|
|
|
|
* empty strings, etc. We only work on filenames as stored in
|
|
|
|
* git, and thus get to ignore all those complications.
|
|
|
|
*/
|
|
|
|
const char *base = strrchr(filename, '/');
|
|
|
|
return base ? base + 1 : filename;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int find_basename_matches(struct diff_options *options,
|
|
|
|
int minimum_score)
|
|
|
|
{
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
/*
|
|
|
|
* When I checked in early 2020, over 76% of file renames in linux
|
|
|
|
* just moved files to a different directory but kept the same
|
|
|
|
* basename. gcc did that with over 64% of renames, gecko did it
|
|
|
|
* with over 79%, and WebKit did it with over 89%.
|
|
|
|
*
|
|
|
|
* Therefore we can bypass the normal exhaustive NxM matrix
|
|
|
|
* comparison of similarities between all potential rename sources
|
|
|
|
* and destinations by instead using file basename as a hint (i.e.
|
|
|
|
* the portion of the filename after the last '/'), checking for
|
|
|
|
* similarity between files with the same basename, and if we find
|
|
|
|
* a pair that are sufficiently similar, record the rename pair and
|
|
|
|
* exclude those two from the NxM matrix.
|
|
|
|
*
|
|
|
|
* This *might* cause us to find a less than optimal pairing (if
|
|
|
|
* there is another file that we are even more similar to but has a
|
|
|
|
* different basename). Given the huge performance advantage
|
|
|
|
* basename matching provides, and given the frequency with which
|
|
|
|
* people use the same basename in real world projects, that's a
|
|
|
|
* trade-off we are willing to accept when doing just rename
|
|
|
|
* detection.
|
|
|
|
*
|
|
|
|
* If someone wants copy detection that implies they are willing to
|
|
|
|
* spend more cycles to find similarities between files, so it may
|
|
|
|
* be less likely that this heuristic is wanted. If someone is
|
|
|
|
* doing break detection, that means they do not want filename
|
|
|
|
* similarity to imply any form of content similiarity, and thus
|
|
|
|
* this heuristic would definitely be incompatible.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int i, renames = 0;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
struct strintmap sources;
|
|
|
|
struct strintmap dests;
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct strmap_entry *entry;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The prefeteching stuff wants to know if it can skip prefetching
|
|
|
|
* blobs that are unmodified...and will then do a little extra work
|
|
|
|
* to verify that the oids are indeed different before prefetching.
|
|
|
|
* Unmodified blobs are only relevant when doing copy detection;
|
|
|
|
* when limiting to rename detection, diffcore_rename[_extended]()
|
|
|
|
* will never be called with unmodified source paths fed to us, so
|
|
|
|
* the extra work necessary to check if rename_src entries are
|
|
|
|
* unmodified would be a small waste.
|
|
|
|
*/
|
|
|
|
int skip_unmodified = 0;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create maps of basename -> fullname(s) for remaining sources and
|
|
|
|
* dests.
|
|
|
|
*/
|
|
|
|
strintmap_init_with_options(&sources, -1, NULL, 0);
|
|
|
|
strintmap_init_with_options(&dests, -1, NULL, 0);
|
|
|
|
for (i = 0; i < rename_src_nr; ++i) {
|
|
|
|
char *filename = rename_src[i].p->one->path;
|
|
|
|
const char *base;
|
|
|
|
|
|
|
|
/* exact renames removed in remove_unneeded_paths_from_src() */
|
|
|
|
assert(!rename_src[i].p->one->rename_used);
|
|
|
|
|
|
|
|
/* Record index within rename_src (i) if basename is unique */
|
|
|
|
base = get_basename(filename);
|
|
|
|
if (strintmap_contains(&sources, base))
|
|
|
|
strintmap_set(&sources, base, -1);
|
|
|
|
else
|
|
|
|
strintmap_set(&sources, base, i);
|
|
|
|
}
|
|
|
|
for (i = 0; i < rename_dst_nr; ++i) {
|
|
|
|
char *filename = rename_dst[i].p->two->path;
|
|
|
|
const char *base;
|
|
|
|
|
|
|
|
if (rename_dst[i].is_rename)
|
|
|
|
continue; /* involved in exact match already. */
|
|
|
|
|
|
|
|
/* Record index within rename_dst (i) if basename is unique */
|
|
|
|
base = get_basename(filename);
|
|
|
|
if (strintmap_contains(&dests, base))
|
|
|
|
strintmap_set(&dests, base, -1);
|
|
|
|
else
|
|
|
|
strintmap_set(&dests, base, i);
|
|
|
|
}
|
|
|
|
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
/* Now look for basename matchups and do similarity estimation */
|
|
|
|
strintmap_for_each_entry(&sources, &iter, entry) {
|
|
|
|
const char *base = entry->key;
|
|
|
|
intptr_t src_index = (intptr_t)entry->value;
|
|
|
|
intptr_t dst_index;
|
|
|
|
if (src_index == -1)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (0 <= (dst_index = strintmap_get(&dests, base))) {
|
|
|
|
struct diff_filespec *one, *two;
|
|
|
|
int score;
|
|
|
|
|
|
|
|
/* Estimate the similarity */
|
|
|
|
one = rename_src[src_index].p->one;
|
|
|
|
two = rename_dst[dst_index].p->two;
|
|
|
|
score = estimate_similarity(options->repo, one, two,
|
|
|
|
minimum_score, skip_unmodified);
|
|
|
|
|
|
|
|
/* If sufficiently similar, record as rename pair */
|
|
|
|
if (score < minimum_score)
|
|
|
|
continue;
|
|
|
|
record_rename_pair(dst_index, src_index, score);
|
|
|
|
renames++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Found a rename so don't need text anymore; if we
|
|
|
|
* didn't find a rename, the filespec_blob would get
|
|
|
|
* re-used when doing the matrix of comparisons.
|
|
|
|
*/
|
|
|
|
diff_free_filespec_blob(one);
|
|
|
|
diff_free_filespec_blob(two);
|
|
|
|
}
|
|
|
|
}
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
|
|
|
|
strintmap_clear(&sources);
|
|
|
|
strintmap_clear(&dests);
|
|
|
|
|
diffcore-rename: complete find_basename_matches()
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories. We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames. If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.
This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file. For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.
This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose. The next
commit will do the necessary plumbing to make use of it.
Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename. I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches. Reason (4) takes a full paragraph to
explain...
If the previous three reasons aren't enough, consider what rename
detection already does. Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar. In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity. Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents. This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well. Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:48 +01:00
|
|
|
return renames;
|
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:47 +01:00
|
|
|
}
|
|
|
|
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
#define NUM_CANDIDATE_PER_DST 4
|
|
|
|
static void record_if_better(struct diff_score m[], struct diff_score *o)
|
|
|
|
{
|
|
|
|
int i, worst;
|
|
|
|
|
|
|
|
/* find the worst one */
|
|
|
|
worst = 0;
|
|
|
|
for (i = 1; i < NUM_CANDIDATE_PER_DST; i++)
|
|
|
|
if (score_compare(&m[i], &m[worst]) > 0)
|
|
|
|
worst = i;
|
|
|
|
|
|
|
|
/* is it better than the worst one? */
|
|
|
|
if (score_compare(&m[worst], o) > 0)
|
|
|
|
m[worst] = *o;
|
|
|
|
}
|
|
|
|
|
2011-01-06 22:50:06 +01:00
|
|
|
/*
|
|
|
|
* Returns:
|
|
|
|
* 0 if we are under the limit;
|
|
|
|
* 1 if we need to disable inexact rename detection;
|
|
|
|
* 2 if we would be under the limit if we were given -C instead of -C -C.
|
|
|
|
*/
|
2020-12-11 10:08:41 +01:00
|
|
|
static int too_many_rename_candidates(int num_destinations, int num_sources,
|
2011-01-06 22:50:04 +01:00
|
|
|
struct diff_options *options)
|
|
|
|
{
|
|
|
|
int rename_limit = options->rename_limit;
|
2020-12-11 10:08:41 +01:00
|
|
|
int i, limited_sources;
|
2011-01-06 22:50:04 +01:00
|
|
|
|
|
|
|
options->needed_rename_limit = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This basically does a test for the rename matrix not
|
|
|
|
* growing larger than a "rename_limit" square matrix, ie:
|
|
|
|
*
|
2020-12-11 10:08:41 +01:00
|
|
|
* num_destinations * num_sources > rename_limit * rename_limit
|
2020-12-11 10:08:42 +01:00
|
|
|
*
|
|
|
|
* We use st_mult() to check overflow conditions; in the
|
|
|
|
* exceptional circumstance that size_t isn't large enough to hold
|
|
|
|
* the multiplication, the system won't be able to allocate enough
|
|
|
|
* memory for the matrix anyway.
|
2011-01-06 22:50:04 +01:00
|
|
|
*/
|
2017-11-29 21:11:54 +01:00
|
|
|
if (rename_limit <= 0)
|
|
|
|
rename_limit = 32767;
|
2020-12-11 10:08:42 +01:00
|
|
|
if (st_mult(num_destinations, num_sources)
|
|
|
|
<= st_mult(rename_limit, rename_limit))
|
2011-01-06 22:50:04 +01:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
options->needed_rename_limit =
|
2020-12-11 10:08:41 +01:00
|
|
|
num_sources > num_destinations ? num_sources : num_destinations;
|
2011-01-06 22:50:06 +01:00
|
|
|
|
|
|
|
/* Are we running under -C -C? */
|
2017-10-31 19:19:11 +01:00
|
|
|
if (!options->flags.find_copies_harder)
|
2011-01-06 22:50:06 +01:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* Would we bust the limit if we were running under -C? */
|
2020-12-11 10:08:41 +01:00
|
|
|
for (limited_sources = i = 0; i < num_sources; i++) {
|
2011-01-06 22:50:06 +01:00
|
|
|
if (diff_unmodified_pair(rename_src[i].p))
|
|
|
|
continue;
|
2020-12-11 10:08:41 +01:00
|
|
|
limited_sources++;
|
2011-01-06 22:50:06 +01:00
|
|
|
}
|
2020-12-11 10:08:42 +01:00
|
|
|
if (st_mult(num_destinations, limited_sources)
|
|
|
|
<= st_mult(rename_limit, rename_limit))
|
2011-01-06 22:50:06 +01:00
|
|
|
return 2;
|
2011-01-06 22:50:04 +01:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-02-19 05:10:32 +01:00
|
|
|
static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)
|
|
|
|
{
|
|
|
|
int count = 0, i;
|
|
|
|
|
|
|
|
for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
|
|
|
|
struct diff_rename_dst *dst;
|
|
|
|
|
|
|
|
if ((mx[i].dst < 0) ||
|
|
|
|
(mx[i].score < minimum_score))
|
|
|
|
break; /* there is no more usable pair. */
|
|
|
|
dst = &rename_dst[mx[i].dst];
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (dst->is_rename)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue; /* already done, either exact or fuzzy. */
|
2011-01-06 22:50:05 +01:00
|
|
|
if (!copies && rename_src[mx[i].src].p->one->rename_used)
|
2011-02-19 05:10:32 +01:00
|
|
|
continue;
|
|
|
|
record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2021-02-14 08:35:01 +01:00
|
|
|
static void remove_unneeded_paths_from_src(int detecting_copies)
|
|
|
|
{
|
|
|
|
int i, new_num_src;
|
|
|
|
|
|
|
|
if (detecting_copies)
|
|
|
|
return; /* nothing to remove */
|
|
|
|
if (break_idx)
|
|
|
|
return; /* culling incompatible with break detection */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note on reasons why we cull unneeded sources but not destinations:
|
|
|
|
* 1) Pairings are stored in rename_dst (not rename_src), which we
|
|
|
|
* need to keep around. So, we just can't cull rename_dst even
|
|
|
|
* if we wanted to. But doing so wouldn't help because...
|
|
|
|
*
|
|
|
|
* 2) There is a matrix pairwise comparison that follows the
|
|
|
|
* "Performing inexact rename detection" progress message.
|
|
|
|
* Iterating over the destinations is done in the outer loop,
|
|
|
|
* hence we only iterate over each of those once and we can
|
|
|
|
* easily skip the outer loop early if the destination isn't
|
|
|
|
* relevant. That's only one check per destination path to
|
|
|
|
* skip.
|
|
|
|
*
|
|
|
|
* By contrast, the sources are iterated in the inner loop; if
|
|
|
|
* we check whether a source can be skipped, then we'll be
|
|
|
|
* checking it N separate times, once for each destination.
|
|
|
|
* We don't want to have to iterate over known-not-needed
|
|
|
|
* sources N times each, so avoid that by removing the sources
|
|
|
|
* from rename_src here.
|
|
|
|
*/
|
|
|
|
for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
|
|
|
|
/*
|
|
|
|
* renames are stored in rename_dst, so if a rename has
|
|
|
|
* already been detected using this source, we can just
|
|
|
|
* remove the source knowing rename_dst has its info.
|
|
|
|
*/
|
|
|
|
if (rename_src[i].p->one->rename_used)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (new_num_src < i)
|
|
|
|
memcpy(&rename_src[new_num_src], &rename_src[i],
|
|
|
|
sizeof(struct diff_rename_src));
|
|
|
|
new_num_src++;
|
|
|
|
}
|
|
|
|
|
|
|
|
rename_src_nr = new_num_src;
|
|
|
|
}
|
|
|
|
|
2005-09-21 09:18:27 +02:00
|
|
|
void diffcore_rename(struct diff_options *options)
|
2005-05-21 11:39:09 +02:00
|
|
|
{
|
2005-09-21 09:18:27 +02:00
|
|
|
int detect_rename = options->detect_rename;
|
|
|
|
int minimum_score = options->rename_score;
|
2005-05-22 04:40:36 +02:00
|
|
|
struct diff_queue_struct *q = &diff_queued_diff;
|
2005-09-16 01:13:43 +02:00
|
|
|
struct diff_queue_struct outq;
|
2005-05-21 11:39:09 +02:00
|
|
|
struct diff_score *mx;
|
2011-01-06 22:50:06 +01:00
|
|
|
int i, j, rename_count, skip_unmodified = 0;
|
2020-12-11 10:08:40 +01:00
|
|
|
int num_destinations, dst_cnt;
|
2021-02-03 21:03:46 +01:00
|
|
|
int num_sources, want_copies;
|
2011-02-20 10:51:16 +01:00
|
|
|
struct progress *progress = NULL;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "setup", options->repo);
|
2021-02-03 21:03:46 +01:00
|
|
|
want_copies = (detect_rename == DIFF_DETECT_COPY);
|
2005-05-22 08:33:32 +02:00
|
|
|
if (!minimum_score)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
minimum_score = DEFAULT_RENAME_SCORE;
|
2005-05-21 11:39:09 +02:00
|
|
|
|
|
|
|
for (i = 0; i < q->nr; i++) {
|
2005-05-21 11:40:01 +02:00
|
|
|
struct diff_filepair *p = q->queue[i];
|
2006-11-02 09:02:11 +01:00
|
|
|
if (!DIFF_FILE_VALID(p->one)) {
|
2005-05-22 04:42:18 +02:00
|
|
|
if (!DIFF_FILE_VALID(p->two))
|
2005-05-23 06:24:49 +02:00
|
|
|
continue; /* unmerged */
|
2006-11-02 09:02:11 +01:00
|
|
|
else if (options->single_follow &&
|
|
|
|
strcmp(options->single_follow, p->two->path))
|
|
|
|
continue; /* not interested */
|
2017-10-31 19:19:11 +01:00
|
|
|
else if (!options->flags.rename_empty &&
|
2017-05-30 19:31:08 +02:00
|
|
|
is_empty_blob_oid(&p->two->oid))
|
teach diffcore-rename to optionally ignore empty content
Our rename detection is a heuristic, matching pairs of
removed and added files with similar or identical content.
It's unlikely to be wrong when there is actual content to
compare, and we already take care not to do inexact rename
detection when there is not enough content to produce good
results.
However, we always do exact rename detection, even when the
blob is tiny or empty. It's easy to get false positives with
an empty blob, simply because it is an obvious content to
use as a boilerplate (e.g., when telling git that an empty
directory is worth tracking via an empty .gitignore).
This patch lets callers specify whether or not they are
interested in using empty files as rename sources and
destinations. The default is "yes", keeping the original
behavior. It works by detecting the empty-blob sha1 for
rename sources and destinations.
One more flexible alternative would be to allow the caller
to specify a minimum size for a blob to be "interesting" for
rename detection. But that would catch small boilerplate
files, not large ones (e.g., if you had the GPL COPYING file
in many directories).
A better alternative would be to allow a "-rename"
gitattribute to allow boilerplate files to be marked as
such. I'll leave the complexity of that solution until such
time as somebody actually wants it. The complaints we've
seen so far revolve around empty files, so let's start with
the simple thing.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-03-22 23:52:13 +01:00
|
|
|
continue;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
else if (add_rename_dst(p) < 0) {
|
2015-02-27 02:42:27 +01:00
|
|
|
warning("skipping rename detection, detected"
|
|
|
|
" duplicate destination '%s'",
|
|
|
|
p->two->path);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
2006-11-02 09:02:11 +01:00
|
|
|
}
|
2017-10-31 19:19:11 +01:00
|
|
|
else if (!options->flags.rename_empty &&
|
2017-05-30 19:31:08 +02:00
|
|
|
is_empty_blob_oid(&p->one->oid))
|
teach diffcore-rename to optionally ignore empty content
Our rename detection is a heuristic, matching pairs of
removed and added files with similar or identical content.
It's unlikely to be wrong when there is actual content to
compare, and we already take care not to do inexact rename
detection when there is not enough content to produce good
results.
However, we always do exact rename detection, even when the
blob is tiny or empty. It's easy to get false positives with
an empty blob, simply because it is an obvious content to
use as a boilerplate (e.g., when telling git that an empty
directory is worth tracking via an empty .gitignore).
This patch lets callers specify whether or not they are
interested in using empty files as rename sources and
destinations. The default is "yes", keeping the original
behavior. It works by detecting the empty-blob sha1 for
rename sources and destinations.
One more flexible alternative would be to allow the caller
to specify a minimum size for a blob to be "interesting" for
rename detection. But that would catch small boilerplate
files, not large ones (e.g., if you had the GPL COPYING file
in many directories).
A better alternative would be to allow a "-rename"
gitattribute to allow boilerplate files to be marked as
such. I'll leave the complexity of that solution until such
time as somebody actually wants it. The complaints we've
seen so far revolve around empty files, so let's start with
the simple thing.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-03-22 23:52:13 +01:00
|
|
|
continue;
|
diffcore-rename: don't consider unmerged path as source
Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for
unmerged entries., 2007-01-05), an unmerged entry should be detected by
using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of
the filepair records mode=0 entries. However, it forgot to update some
parts of the rename detection logic.
This only makes difference in the "diff --cached" codepath where an
unmerged filepair carries information on the entries that came from the
tree. It probably hasn't been noticed for a long time because nobody
would run "diff -M" during a conflict resolution, but "git status" uses
rename detection when it internally runs "diff-index" and "diff-files"
and gives nonsense results.
In an unmerged pair, "one" side can have a valid filespec to record the
tree entry (e.g. what's in HEAD) when running "diff --cached". This can
be used as a rename source to other paths in the index that are not
unmerged. The path that is unmerged by definition does not have the
final content yet (i.e. "two" side cannot have a valid filespec), so it
can never be a rename destination.
Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and
allow the valid "one" side of an unmerged filepair to be considered a
potential rename source, but never to be considered a rename destination.
Commit message and first two test cases by Junio, the rest by Martin.
Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-24 03:41:01 +01:00
|
|
|
else if (!DIFF_PAIR_UNMERGED(p) && !DIFF_FILE_VALID(p->two)) {
|
2007-10-25 20:20:56 +02:00
|
|
|
/*
|
|
|
|
* If the source is a broken "delete", and
|
2005-06-12 05:55:20 +02:00
|
|
|
* they did not really want to get broken,
|
|
|
|
* that means the source actually stays.
|
2007-10-25 20:20:56 +02:00
|
|
|
* So we increment the "rename_used" score
|
|
|
|
* by one, to indicate ourselves as a user
|
|
|
|
*/
|
|
|
|
if (p->broken_pair && !p->score)
|
|
|
|
p->one->rename_used++;
|
2011-01-06 22:50:05 +01:00
|
|
|
register_rename_src(p);
|
2007-10-25 20:20:56 +02:00
|
|
|
}
|
2021-02-03 21:03:46 +01:00
|
|
|
else if (want_copies) {
|
2007-10-25 20:20:56 +02:00
|
|
|
/*
|
|
|
|
* Increment the "rename_used" score by
|
|
|
|
* one, to indicate ourselves as a user.
|
2005-06-12 05:55:20 +02:00
|
|
|
*/
|
2007-10-25 20:20:56 +02:00
|
|
|
p->one->rename_used++;
|
2011-01-06 22:50:05 +01:00
|
|
|
register_rename_src(p);
|
2005-06-12 05:55:20 +02:00
|
|
|
}
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "setup", options->repo);
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
if (rename_dst_nr == 0 || rename_src_nr == 0)
|
2005-05-21 11:39:09 +02:00
|
|
|
goto cleanup; /* nothing to do */
|
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "exact renames", options->repo);
|
Do exact rename detection regardless of rename limits
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-25 20:24:47 +02:00
|
|
|
/*
|
|
|
|
* We really want to cull the candidates list early
|
|
|
|
* with cheap tests in order to avoid doing deltas.
|
|
|
|
*/
|
2011-02-19 04:55:19 +01:00
|
|
|
rename_count = find_exact_renames(options);
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "exact renames", options->repo);
|
Do exact rename detection regardless of rename limits
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-25 20:24:47 +02:00
|
|
|
|
2007-10-27 01:56:34 +02:00
|
|
|
/* Did we only want exact renames? */
|
|
|
|
if (minimum_score == MAX_SCORE)
|
|
|
|
goto cleanup;
|
|
|
|
|
2021-02-03 21:03:46 +01:00
|
|
|
num_sources = rename_src_nr;
|
2007-10-27 01:56:34 +02:00
|
|
|
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
if (want_copies || break_idx) {
|
|
|
|
/*
|
|
|
|
* Cull sources:
|
|
|
|
* - remove ones corresponding to exact renames
|
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull after exact", options->repo);
|
|
|
|
remove_unneeded_paths_from_src(want_copies);
|
|
|
|
trace2_region_leave("diff", "cull after exact", options->repo);
|
|
|
|
} else {
|
|
|
|
/* Determine minimum score to match basenames */
|
|
|
|
double factor = 0.5;
|
|
|
|
char *basename_factor = getenv("GIT_BASENAME_FACTOR");
|
|
|
|
int min_basename_score;
|
|
|
|
|
|
|
|
if (basename_factor)
|
|
|
|
factor = strtol(basename_factor, NULL, 10)/100.0;
|
|
|
|
assert(factor >= 0.0 && factor <= 1.0);
|
|
|
|
min_basename_score = minimum_score +
|
|
|
|
(int)(factor * (MAX_SCORE - minimum_score));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cull sources:
|
|
|
|
* - remove ones involved in renames (found via exact match)
|
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull after exact", options->repo);
|
|
|
|
remove_unneeded_paths_from_src(want_copies);
|
|
|
|
trace2_region_leave("diff", "cull after exact", options->repo);
|
|
|
|
|
|
|
|
/* Utilize file basenames to quickly find renames. */
|
|
|
|
trace2_region_enter("diff", "basename matches", options->repo);
|
|
|
|
rename_count += find_basename_matches(options,
|
|
|
|
min_basename_score);
|
|
|
|
trace2_region_leave("diff", "basename matches", options->repo);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cull sources, again:
|
|
|
|
* - remove ones involved in renames (found via basenames)
|
|
|
|
*/
|
|
|
|
trace2_region_enter("diff", "cull basename", options->repo);
|
|
|
|
remove_unneeded_paths_from_src(want_copies);
|
|
|
|
trace2_region_leave("diff", "cull basename", options->repo);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Calculate how many rename destinations are left */
|
|
|
|
num_destinations = (rename_dst_nr - rename_count);
|
|
|
|
num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
|
|
|
|
|
2007-10-27 01:56:34 +02:00
|
|
|
/* All done? */
|
2021-02-03 21:03:46 +01:00
|
|
|
if (!num_destinations || !num_sources)
|
2007-10-27 01:56:34 +02:00
|
|
|
goto cleanup;
|
|
|
|
|
2021-02-03 21:03:46 +01:00
|
|
|
switch (too_many_rename_candidates(num_destinations, num_sources,
|
2020-12-11 10:08:41 +01:00
|
|
|
options)) {
|
2011-01-06 22:50:06 +01:00
|
|
|
case 1:
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
goto cleanup;
|
2011-01-06 22:50:06 +01:00
|
|
|
case 2:
|
|
|
|
options->degraded_cc_to_c = 1;
|
|
|
|
skip_unmodified = 1;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
Fix the rename detection limit checking
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-14 19:39:48 +02:00
|
|
|
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "inexact renames", options->repo);
|
2011-02-20 10:51:16 +01:00
|
|
|
if (options->show_rename_progress) {
|
progress: simplify "delayed" progress API
We used to expose the full power of the delayed progress API to the
callers, so that they can specify, not just the message to show and
expected total amount of work that is used to compute the percentage
of work performed so far, the percent-threshold parameter P and the
delay-seconds parameter N. The progress meter starts to show at N
seconds into the operation only if we have not yet completed P per-cent
of the total work.
Most callers used either (0%, 2s) or (50%, 1s) as (P, N), but there
are oddballs that chose more random-looking values like 95%.
For a smoother workload, (50%, 1s) would allow us to start showing
the progress meter earlier than (0%, 2s), while keeping the chance
of not showing progress meter for long running operation the same as
the latter. For a task that would take 2s or more to complete, it
is likely that less than half of it would complete within the first
second, if the workload is smooth. But for a spiky workload whose
earlier part is easier, such a setting is likely to fail to show the
progress meter entirely and (0%, 2s) is more appropriate.
But that is merely a theory. Realistically, it is of dubious value
to ask each codepath to carefully consider smoothness of their
workload and specify their own setting by passing two extra
parameters. Let's simplify the API by dropping both parameters and
have everybody use (0%, 2s).
Oh, by the way, the percent-threshold parameter and the structure
member were consistently misspelled, which also is now fixed ;-)
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-19 19:39:41 +02:00
|
|
|
progress = start_delayed_progress(
|
2014-02-21 13:50:18 +01:00
|
|
|
_("Performing inexact rename detection"),
|
2021-02-03 21:03:46 +01:00
|
|
|
(uint64_t)num_destinations * (uint64_t)num_sources);
|
2011-02-20 10:51:16 +01:00
|
|
|
}
|
|
|
|
|
2020-12-11 10:08:40 +01:00
|
|
|
mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
|
|
|
|
sizeof(*mx));
|
2005-05-24 10:10:48 +02:00
|
|
|
for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_filespec *two = rename_dst[i].p->two;
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
struct diff_score *m;
|
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].is_rename)
|
diffcore-rename: guide inexact rename detection based on basenames
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames. As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files. This is a behavioral change, as
covered in more detail in the previous commit message.
We do not use this heuristic together with either break or copy
detection. The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity. The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities. So the idea behind this
optimization goes against both of those features, and will be turned off
for both.
For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:
Before After
no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s
mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s
just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-14 08:51:49 +01:00
|
|
|
continue; /* exact or basename match already handled */
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
|
|
|
m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
|
|
|
|
for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
|
|
|
|
m[j].dst = -1;
|
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
for (j = 0; j < rename_src_nr; j++) {
|
2011-01-06 22:50:05 +01:00
|
|
|
struct diff_filespec *one = rename_src[j].p->one;
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
struct diff_score this_src;
|
2011-01-06 22:50:06 +01:00
|
|
|
|
2021-02-14 08:35:01 +01:00
|
|
|
assert(!one->rename_used || want_copies || break_idx);
|
2021-02-03 21:03:46 +01:00
|
|
|
|
2011-01-06 22:50:06 +01:00
|
|
|
if (skip_unmodified &&
|
|
|
|
diff_unmodified_pair(rename_src[j].p))
|
|
|
|
continue;
|
|
|
|
|
2018-09-21 17:57:19 +02:00
|
|
|
this_src.score = estimate_similarity(options->repo,
|
|
|
|
one, two,
|
diff: restrict when prefetching occurs
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-08 00:11:43 +02:00
|
|
|
minimum_score,
|
|
|
|
skip_unmodified);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
this_src.name_score = basename_same(one, two);
|
|
|
|
this_src.dst = i;
|
|
|
|
this_src.src = j;
|
|
|
|
record_if_better(m, &this_src);
|
2009-11-21 07:13:47 +01:00
|
|
|
/*
|
|
|
|
* Once we run estimate_similarity,
|
|
|
|
* We do not need the text anymore.
|
|
|
|
*/
|
2007-10-03 06:01:03 +02:00
|
|
|
diff_free_filespec_blob(one);
|
2009-11-21 07:13:47 +01:00
|
|
|
diff_free_filespec_blob(two);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
|
|
|
dst_cnt++;
|
2020-12-11 10:08:43 +01:00
|
|
|
display_progress(progress,
|
2021-02-03 21:03:46 +01:00
|
|
|
(uint64_t)dst_cnt * (uint64_t)num_sources);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2011-02-20 10:51:16 +01:00
|
|
|
stop_progress(&progress);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
2005-05-21 11:39:09 +02:00
|
|
|
/* cost matrix sorted by most to least similar pair */
|
2019-09-30 19:21:55 +02:00
|
|
|
STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
|
Optimize rename detection for a huge diff
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-30 05:54:56 +01:00
|
|
|
|
2011-02-19 05:10:32 +01:00
|
|
|
rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
|
2021-02-03 21:03:46 +01:00
|
|
|
if (want_copies)
|
2011-02-19 05:10:32 +01:00
|
|
|
rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
|
2005-05-21 11:39:09 +02:00
|
|
|
free(mx);
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "inexact renames", options->repo);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-28 00:55:55 +02:00
|
|
|
cleanup:
|
2005-05-21 11:39:09 +02:00
|
|
|
/* At this point, we have found some renames and copies and they
|
2005-09-16 01:13:43 +02:00
|
|
|
* are recorded in rename_dst. The original list is still in *q.
|
2005-05-21 11:39:09 +02:00
|
|
|
*/
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_enter("diff", "write back to queue", options->repo);
|
2010-05-07 06:52:27 +02:00
|
|
|
DIFF_QUEUE_CLEAR(&outq);
|
2005-05-21 11:39:09 +02:00
|
|
|
for (i = 0; i < q->nr; i++) {
|
2005-05-24 10:10:48 +02:00
|
|
|
struct diff_filepair *p = q->queue[i];
|
|
|
|
struct diff_filepair *pair_to_free = NULL;
|
|
|
|
|
diffcore-rename: don't consider unmerged path as source
Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for
unmerged entries., 2007-01-05), an unmerged entry should be detected by
using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of
the filepair records mode=0 entries. However, it forgot to update some
parts of the rename detection logic.
This only makes difference in the "diff --cached" codepath where an
unmerged filepair carries information on the entries that came from the
tree. It probably hasn't been noticed for a long time because nobody
would run "diff -M" during a conflict resolution, but "git status" uses
rename detection when it internally runs "diff-index" and "diff-files"
and gives nonsense results.
In an unmerged pair, "one" side can have a valid filespec to record the
tree entry (e.g. what's in HEAD) when running "diff --cached". This can
be used as a rename source to other paths in the index that are not
unmerged. The path that is unmerged by definition does not have the
final content yet (i.e. "two" side cannot have a valid filespec), so it
can never be a rename destination.
Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and
allow the valid "one" side of an unmerged filepair to be considered a
potential rename source, but never to be considered a rename destination.
Commit message and first two test cases by Junio, the rest by Martin.
Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-24 03:41:01 +01:00
|
|
|
if (DIFF_PAIR_UNMERGED(p)) {
|
|
|
|
diff_q(&outq, p);
|
|
|
|
}
|
|
|
|
else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* Creation */
|
|
|
|
diff_q(&outq, p);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2005-05-30 09:08:07 +02:00
|
|
|
else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
|
|
|
|
/*
|
|
|
|
* Deletion
|
|
|
|
*
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
* We would output this delete record if:
|
|
|
|
*
|
|
|
|
* (1) this is a broken delete and the counterpart
|
|
|
|
* broken create remains in the output; or
|
2005-09-16 01:13:43 +02:00
|
|
|
* (2) this is not a broken delete, and rename_dst
|
|
|
|
* does not have a rename/copy to move p->one->path
|
|
|
|
* out of existence.
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
*
|
|
|
|
* Otherwise, the counterpart broken create
|
|
|
|
* has been turned into a rename-edit; or
|
|
|
|
* delete did not have a matching create to
|
|
|
|
* begin with.
|
2005-05-30 09:08:07 +02:00
|
|
|
*/
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
if (DIFF_PAIR_BROKEN(p)) {
|
|
|
|
/* broken delete */
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
struct diff_rename_dst *dst = locate_rename_dst(p);
|
|
|
|
if (!dst)
|
|
|
|
BUG("tracking failed somehow; failed to find associated dst for broken pair");
|
|
|
|
if (dst->is_rename)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
/* counterpart is now rename/copy */
|
|
|
|
pair_to_free = p;
|
|
|
|
}
|
|
|
|
else {
|
2007-10-25 20:20:56 +02:00
|
|
|
if (p->one->rename_used)
|
[PATCH] Add -B flag to diff-* brothers.
A new diffcore transformation, diffcore-break.c, is introduced.
When the -B flag is given, a patch that represents a complete
rewrite is broken into a deletion followed by a creation. This
makes it easier to review such a complete rewrite patch.
The -B flag takes the same syntax as the -M and -C flags to
specify the minimum amount of non-source material the resulting
file needs to have to be considered a complete rewrite, and
defaults to 99% if not specified.
As the new test t4008-diff-break-rewrite.sh demonstrates, if a
file is a complete rewrite, it is broken into a delete/create
pair, which can further be subjected to the usual rename
detection if -M or -C is used. For example, if file0 gets
completely rewritten to make it as if it were rather based on
file1 which itself disappeared, the following happens:
The original change looks like this:
file0 --> file0' (quite different from file0)
file1 --> /dev/null
After diffcore-break runs, it would become this:
file0 --> /dev/null
/dev/null --> file0'
file1 --> /dev/null
Then diffcore-rename matches them up:
file1 --> file0'
The internal score values are finer grained now. Earlier
maximum of 10000 has been raised to 60000; there is no user
visible changes but there is no reason to waste available bits.
Signed-off-by: Junio C Hamano <junkio@cox.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-30 09:08:37 +02:00
|
|
|
/* this path remains */
|
|
|
|
pair_to_free = p;
|
|
|
|
}
|
2005-05-30 09:08:07 +02:00
|
|
|
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (!pair_to_free)
|
2005-05-30 09:08:07 +02:00
|
|
|
diff_q(&outq, p);
|
|
|
|
}
|
2005-05-24 10:10:48 +02:00
|
|
|
else if (!diff_unmodified_pair(p))
|
2005-05-28 00:55:55 +02:00
|
|
|
/* all the usual ones need to be kept */
|
2005-05-24 10:10:48 +02:00
|
|
|
diff_q(&outq, p);
|
2005-05-28 00:55:55 +02:00
|
|
|
else
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
/* no need to keep unmodified pairs; FIXME: remove earlier? */
|
2005-05-28 00:55:55 +02:00
|
|
|
pair_to_free = p;
|
|
|
|
|
2005-05-28 00:50:30 +02:00
|
|
|
if (pair_to_free)
|
|
|
|
diff_free_filepair(pair_to_free);
|
2005-05-21 11:39:09 +02:00
|
|
|
}
|
2005-05-24 10:10:48 +02:00
|
|
|
diff_debug_queue("done copying original", &outq);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2005-05-24 10:10:48 +02:00
|
|
|
free(q->queue);
|
|
|
|
*q = outq;
|
|
|
|
diff_debug_queue("done collapsing", q);
|
2005-05-21 11:39:09 +02:00
|
|
|
|
2007-10-25 20:19:10 +02:00
|
|
|
for (i = 0; i < rename_dst_nr; i++)
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (rename_dst[i].filespec_to_free)
|
|
|
|
free_filespec(rename_dst[i].filespec_to_free);
|
2005-09-16 01:13:43 +02:00
|
|
|
|
2017-06-16 01:15:46 +02:00
|
|
|
FREE_AND_NULL(rename_dst);
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_dst_nr = rename_dst_alloc = 0;
|
2017-06-16 01:15:46 +02:00
|
|
|
FREE_AND_NULL(rename_src);
|
2005-05-24 10:10:48 +02:00
|
|
|
rename_src_nr = rename_src_alloc = 0;
|
diffcore-rename: accelerate rename_dst setup
register_rename_src() simply references the passed pair inside
rename_src. In contrast, add_rename_dst() did something entirely
different for rename_dst. Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL. Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs. This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary. Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src. Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.
Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go. However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on). Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index. Doing that allows us to still have a
fast lookup without sorting the rename_dst array. Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.
There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.
This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime. This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-11 10:08:47 +01:00
|
|
|
if (break_idx) {
|
|
|
|
strintmap_clear(break_idx);
|
|
|
|
FREE_AND_NULL(break_idx);
|
|
|
|
}
|
merge-ort: begin performance work; instrument with trace2_region_* calls
Add some timing instrumentation for both merge-ort and diffcore-rename;
I used these to measure and optimize performance in both, and several
future patch series will build on these to reduce the timings of some
select testcases.
=== Setup ===
The primary testcase I used involved rebasing a random topic in the
linux kernel (consisting of 35 patches) against an older version. I
added two variants, one where I rename a toplevel directory, and another
where I only rebase one patch instead of the whole topic. The setup is
as follows:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
$ git branch hwmon-updates fd8bdb23b91876ac1e624337bb88dc1dcc21d67e
$ git branch hwmon-just-one fd8bdb23b91876ac1e624337bb88dc1dcc21d67e~34
$ git branch base 4703d9119972bf586d2cca76ec6438f819ffa30e
$ git switch -c 5.4-renames v5.4
$ git mv drivers pilots # Introduce over 26,000 renames
$ git commit -m "Rename drivers/ to pilots/"
$ git config merge.renameLimit 30000
$ git config merge.directoryRenames true
=== Testcases ===
Now with REBASE standing for either "git rebase [--merge]" (using
merge-recursive) or "test-tool fast-rebase" (using merge-ort), the
testcases are:
Testcase #1: no-renames
$ git checkout v5.4^0
$ REBASE --onto HEAD base hwmon-updates
Note: technically the name is misleading; there are some renames, but
very few. Rename detection only takes about half the overall time.
Testcase #2: mega-renames
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-updates
Testcase #3: just-one-mega
$ git checkout 5.4-renames^0
$ REBASE --onto HEAD base hwmon-just-one
=== Timing results ===
Overall timings, using hyperfine (1 warmup run, 3 runs for mega-renames,
10 runs for the other two cases):
merge-recursive merge-ort
no-renames: 18.912 s ± 0.174 s 14.263 s ± 0.053 s
mega-renames: 5964.031 s ± 10.459 s 5504.231 s ± 5.150 s
just-one-mega: 149.583 s ± 0.751 s 158.534 s ± 0.498 s
A single re-run of each with some breakdowns:
--- no-renames ---
merge-recursive merge-ort
overall runtime: 19.302 s 14.257 s
inexact rename detection: 7.603 s 7.906 s
everything else: 11.699 s 6.351 s
--- mega-renames ---
merge-recursive merge-ort
overall runtime: 5950.195 s 5499.672 s
inexact rename detection: 5746.309 s 5487.120 s
everything else: 203.886 s 17.552 s
--- just-one-mega ---
merge-recursive merge-ort
overall runtime: 151.001 s 158.582 s
inexact rename detection: 143.448 s 157.835 s
everything else: 7.553 s 0.747 s
=== Timing observations ===
0) Maximum speedup
The "everything else" row represents the maximum speedup we could
achieve if we were to somehow infinitely parallelize inexact rename
detection, but leave everything else alone. The fact that this is so
much smaller than the real runtime (even in the case with virtually no
renames) makes it clear just how overwhelmingly large the time spent on
rename detection can be.
1) no-renames
1a) merge-ort is faster than merge-recursive, which is nice. However,
this still should not be considered good enough. Although the "merge"
backend to rebase (merge-recursive) is sometimes faster than the "apply"
backend, this is one of those cases where it is not. In fact, even
merge-ort is slower. The "apply" backend can complete this testcase in
6.940 s ± 0.485 s
which is about 2x faster than merge-ort and 3x faster than
merge-recursive. One goal of the merge-ort performance work will be to
make it faster than git-am on this (and similar) testcases.
2) mega-renames
2a) Obviously rename detection is a huge cost; it's where most the time
is spent. We need to cut that down. If we could somehow infinitely
parallelize it and drive its time to 0, the merge-recursive time would
drop to about 204s, and the merge-ort time would drop to about 17s. I
think this particular stat shows I've subtly baked a couple performance
improvements into merge-ort and into fast-rebase already.
3) just-one-mega
3a) not much to say here, it just gives some flavor for how rebasing
only one patch compares to rebasing 35.
=== Goals ===
This patch is obviously just the beginning. Here are some of my goals
that this measurement will help us achieve:
* Drive the cost of rename detection down considerably for merges
* After the above has been achieved, see if there are other slowness
factors (which would have previously been overshadowed by rename
detection costs) which we can then focus on and also optimize.
* Ensure our rebase testcase that requires little rename detection
is noticeably faster with merge-ort than with apply-based rebase.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Taylor Blau <ttaylorr@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-24 07:01:12 +01:00
|
|
|
trace2_region_leave("diff", "write back to queue", options->repo);
|
2005-05-21 11:39:09 +02:00
|
|
|
return;
|
|
|
|
}
|