1. 03 Apr, 2013 1 commit
    • John Keeping's avatar
      diffcore-break: don't divide by zero · e7b00c57
      John Keeping authored
      When the source file is empty, the calculation of the merge score
      results in a division by zero.  In the situation:
           == preimage ==             == postimage ==
           F (empty file)             F (a large file)
                                      E (a new empty file)
      it does not make sense to consider F->E as a rename, so it is better not
      to break the pre- and post-image of F.
      Bail out early in this case to avoid hitting the divide-by-zero.  This
      causes the merge score to be left at zero.
      Signed-off-by: John Keeping's avatarJohn Keeping <john@keeping.me.uk>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
  2. 07 May, 2010 1 commit
  3. 16 Nov, 2009 2 commits
    • Jeff King's avatar
      diffcore-break: save cnt_data for other phases · 8282de94
      Jeff King authored
      The "break" phase works by counting changes between two
      blobs with the same path. We do this by splitting the file
      into chunks (or lines for text oriented files) and then
      keeping a count of chunk hashes.
      The "rename" phase counts changes between blobs at two
      different paths. However, it uses the exact same set of
      chunk hashes (which are immutable for a given sha1).
      The rename phase can therefore use the same hash data as
      break. Unfortunately, we were throwing this data away after
      computing it in the break phase. This patch instead attaches
      it to the filespec and lets it live through the rename
      phase, working under the assumption that most of the time
      that breaks are being computed, renames will be too.
      We only do this optimization for files which have actually
      been broken, as those ones will be candidates for rename
      detection (and it is a time-space tradeoff, so we don't want
      to waste space keeping useless data).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    • Jeff King's avatar
      diffcore-break: free filespec data as we go · f4f19fb6
      Jeff King authored
      As we look at each changed file and consider breaking it, we
      load the blob data and make a decision about whether to
      break, which is independent of any other blobs that might
      have changed. However, we keep the data in memory while we
      consider breaking all of the other files. Which means that
      both versions of every file you are diffing are in memory at
      the same time.
      This patch instead frees the blob data as we finish with
      each file pair, leading to much lower memory usage.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
  4. 08 Mar, 2009 1 commit
  5. 02 Dec, 2007 1 commit
    • Junio C Hamano's avatar
      rename: Break filepairs with different types. · b45563a2
      Junio C Hamano authored
      When we consider if a path has been totally rewritten, we did not
      touch changes from symlinks to files or vice versa.  But a change
      that modifies even the type of a blob surely should count as a
      complete rewrite.
      While we are at it, modernise diffcore-break to be aware of gitlinks (we
      do not want to touch them).
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
  6. 21 Oct, 2007 1 commit
    • Linus Torvalds's avatar
      Fix diffcore-break total breakage · 6dd4b66f
      Linus Torvalds authored
      Ok, so on the kernel list, some people noticed that "git log --follow"
      doesn't work too well with some files in the x86 merge, because a lot of
      files got renamed in very special ways.
      In particular, there was a pattern of doing single commits with renames
      that looked basically like
       - rename "filename.h" -> "filename_64.h"
       - create new "filename.c" that includes "filename_32.h" or
         "filename_64.h" depending on whether we're 32-bit or 64-bit.
      which was preparatory for smushing the two trees together.
      Now, there's two issues here:
       - "filename.c" *remained*. Yes, it was a rename, but there was a new file
         created with the old name in the same commit. This was important,
         because we wanted each commit to compile properly, so that it was
         bisectable, so splitting the rename into one commit and the "create
         helper file" into another was *not* an option.
         So we need to break associations where the contents change too much.
         Fine. We have the -B flag for that. When we break things up, then the
         rename detection will be able to figure out whether there are better
       - "git log --follow" didn't with with -B.
      Now, the second case was really simple: we use a different "diffopt"
      structure for the rename detection than the basic one (which we use for
      showing the diffs). So that second case is trivially fixed by a trivial
      one-liner that just copies the break_opt values from the "real" diffopts
      to the one used for rename following. So now "git log -B --follow" works
      	diff --git a/tree-diff.c b/tree-diff.c
      	index 26bdbdd..7c261fd 100644
      	--- a/tree-diff.c
      	+++ b/tree-diff.c
      	@@ -319,6 +319,7 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
      	 	diff_opts.detect_rename = DIFF_DETECT_RENAME;
      	 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
      	 	diff_opts.single_follow = opt->paths[0];
      	+	diff_opts.break_opt = opt->break_opt;
      	 	paths[0] = NULL;
      	 	diff_tree_setup_paths(paths, &diff_opts);
      	 	if (diff_setup_done(&diff_opts) < 0)
      however, the end result does *not* work. Because our diffcore-break.c
      logic is totally bogus!
      In particular:
       - it used to do
      	if (base_size < MINIMUM_BREAK_SIZE)
      		return 0; /* we do not break too small filepair */
         which basically says "don't bother to break small files". But that
         "base_size" is the *smaller* of the two sizes, which means that if some
         large file was rewritten into one that just includes another file, we
         would look at the (small) result, and decide that it's smaller than the
         break size, so it cannot be worth it to break it up! Even if the other
         side was ten times bigger and looked *nothing* like the samell file!
         That's clearly bogus. I replaced "base_size" with "max_size", so that
         we compare the *bigger* of the filepair with the break size.
       - It calculated a "merge_score", which was the score needed to merge it
         back together if nothing else wanted it. But even if it was *so*
         different that we would never want to merge it back, we wouldn't
         consider it a break! That makes no sense. So I added
      	if (*merge_score_p > break_score)
      		return 1;
         to make it clear that if we wouldn't want to merge it at the end, it
         was *definitely* a break.
       - It compared the whole "extent of damage", counting all inserts and
         deletes, but it based this score on the "base_size", and generated the
         damage score with
      	delta_size = src_removed + literal_added;
      	damage_score = delta_size * MAX_SCORE / base_size;
         but that makes no sense either, since quite often, this will result in
         a number that is *bigger* than MAX_SCORE! Why? Because base_size is
         (again) the smaller of the two files we compare, and when you start out
         from a small file and add a lot (or start out from a large file and
         remove a lot), the base_size is going to be much smaller than the
         Again, the fix was to replace "base_size" with "max_size", at which
         point the damage actually becomes a sane percentage of the whole.
      With these changes in place, not only does "git log -B --follow" work for
      the case that triggered this in the first place, ie now
      	git log -B --follow arch/x86/kernel/vmlinux_64.lds.S
      actually gives reasonable results. But I also wanted to verify it in
      general, by doing a full-history
      	git log --stat -B -C
      on my kernel tree with the old code and the new code.
      There's some tweaking to be done, but generally, the new code generates
      much better results wrt breaking up files (and then finding better rename
      candidates). Here's a few examples of the "--stat" output:
       - This:
      	include/asm-x86/Kbuild        |    2 -
      	include/asm-x86/debugreg.h    |   79 +++++++++++++++++++++++++++++++++++------
      	include/asm-x86/debugreg_32.h |   64 ---------------------------------
      	include/asm-x86/debugreg_64.h |   65 ---------------------------------
      	4 files changed, 68 insertions(+), 142 deletions(-)
      	include/asm-x86/Kbuild                        |    2 -
      	include/asm-x86/{debugreg_64.h => debugreg.h} |    9 +++-
      	include/asm-x86/debugreg_32.h                 |   64 -------------------------
      	3 files changed, 7 insertions(+), 68 deletions(-)
       - This:
      	include/asm-x86/bug.h    |   41 +++++++++++++++++++++++++++++++++++++++--
      	include/asm-x86/bug_32.h |   37 -------------------------------------
      	include/asm-x86/bug_64.h |   34 ----------------------------------
      	3 files changed, 39 insertions(+), 73 deletions(-)
      	include/asm-x86/{bug_64.h => bug.h} |   20 +++++++++++++-----
      	include/asm-x86/bug_32.h            |   37 -----------------------------------
      	2 files changed, 14 insertions(+), 43 deletions(-)
      Now, in some other cases, it does actually turn a rename into a real
      "delete+create" pair, and then the diff is usually bigger, so truth in
      advertizing: it doesn't always generate a nicer diff. But for what -B was
      meant for, I think this is a big improvement, and I suspect those cases
      where it generates a bigger diff are tweakable.
      So I think this diff fixes a real bug, but we might still want to tweak
      the default values and perhaps the exact rules for when a break happens.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarShawn O. Pearce <spearce@spearce.org>
  7. 01 Jul, 2007 1 commit
    • Junio C Hamano's avatar
      diffcore_count_changes: pass diffcore_filespec · d8c3d03a
      Junio C Hamano authored
      We may want to use richer information on the data we are dealing
      with in this function, so instead of passing a buffer address
      and length, just pass the diffcore_filespec structure.  Existing
      callers always call this function with parameters taken from a
      filespec anyway, so there is no functionality changes.
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
  8. 07 Mar, 2007 1 commit
    • Shawn O. Pearce's avatar
      Cast 64 bit off_t to 32 bit size_t · dc49cd76
      Shawn O. Pearce authored
      Some systems have sizeof(off_t) == 8 while sizeof(size_t) == 4.
      This implies that we are able to access and work on files whose
      maximum length is around 2^63-1 bytes, but we can only malloc or
      mmap somewhat less than 2^32-1 bytes of memory.
      On such a system an implicit conversion of off_t to size_t can cause
      the size_t to wrap, resulting in unexpected and exciting behavior.
      Right now we are working around all gcc warnings generated by the
      -Wshorten-64-to-32 option by passing the off_t through xsize_t().
      In the future we should make xsize_t on such problematic platforms
      detect the wrapping and die if such a file is accessed.
      Signed-off-by: default avatarShawn O. Pearce <spearce@spearce.org>
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
  9. 17 Aug, 2006 1 commit
  10. 12 Mar, 2006 1 commit
  11. 04 Mar, 2006 1 commit
  12. 02 Mar, 2006 1 commit
    • Nicolas Pitre's avatar
      diff-delta: allow reusing of the reference buffer index · 38fd0721
      Nicolas Pitre authored
      When a reference buffer is used multiple times then its index can be
      computed only once and reused multiple times.  This patch adds an extra
      pointer to a pointer argument (from_index) to diff_delta() for this.
      If from_index is NULL then everything is like before.
      If from_index is non NULL and *from_index is NULL then the index is
      created and its location stored to *from_index.  In this case the caller
      has the responsibility to free the memory pointed to by *from_index.
      If from_index and *from_index are non NULL then the index is reused as
      This currently saves about 10% of CPU time to repack the git archive.
      Signed-off-by: default avatarNicolas Pitre <nico@cam.org>
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
  13. 01 Mar, 2006 3 commits
  14. 13 Dec, 2005 1 commit
  15. 12 Dec, 2005 1 commit
  16. 14 Sep, 2005 1 commit
  17. 14 Aug, 2005 1 commit
    • Yasushi SHOJI's avatar
      [PATCH] plug memory leak in diff.c::diff_free_filepair() · 068eac91
      Yasushi SHOJI authored
      When I run git-diff-tree on big change, it seems the command eats so
      much memory.  so I just put git under valgrind to see what's going on.
      diff_free_filespec_data() doesn't free diff_filespec itself.
      [jc: I ended up doing things slightly differently from Yasushi's
      patch.  The original idea was to use free_filespec_data() only to
      free the data portion and keep useing the filespec itself, but
      no existing code seems to do things that way, so I just yanked
      that part out.]
      Signed-off-by: default avatarYasushi SHOJI <yashi@atmark-techno.com>
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
  18. 29 Jun, 2005 1 commit
  19. 26 Jun, 2005 1 commit
    • Linus Torvalds's avatar
      Add a "max_size" parameter to diff_delta() · 75c42d8c
      Linus Torvalds authored
      Anything that generates a delta to see if two objects are close usually
      isn't interested in the delta ends up being bigger than some specified
      size, and this allows us to stop delta generation early when that
  20. 20 Jun, 2005 1 commit
    • Junio C Hamano's avatar
      [PATCH] Rework -B output. · 366175ef
      Junio C Hamano authored
      Patch for a completely rewritten file detected by the -B flag
      was shown as a pair of creation followed by deletion in earlier
      versions.  This was an misguided attempt to make reviewing such
      a complete rewrite easier, and unnecessarily ended up confusing
      git-apply.  Instead, show the entire contents of old version
      prefixed with '-', followed by the entire contents of new
      version prefixed with '+'.  This gives the same easy-to-review
      for human consumer while keeping it a single, regular
      modification patch for machine consumption, something that even
      GNU patch can grok.
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  21. 05 Jun, 2005 1 commit
    • Junio C Hamano's avatar
      [PATCH] diffcore-break.c: various fixes. · f78c79c5
      Junio C Hamano authored
      This fixes three bugs in the -B heuristics.
       - Although it was advertised that the initial break criteria
         used was the same as what diffcore-rename uses, it was using
         something different.  Instead of using smaller of src and dst
         size to compare with "edit" size, (insertion and deletion),
         it was using larger of src and dst, unlike the rename/copy
         detection logic.  This caused the parameter to -B to mean
         something different from the one to -M and -C.  To compensate
         for this change, the default break score is also changed to
         match that of the default for rename/copy.
       - The code would have crashed with division by zero when trying
         to break an originally empty file.
       - Contrary to what the comment said, the algorithm was breaking
         small files, only to later merge them together.
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  22. 03 Jun, 2005 2 commits
    • Junio C Hamano's avatar
      [PATCH] diff: Update -B heuristics. · eeaa4603
      Junio C Hamano authored
      As Linus pointed out on the mailing list discussion, -B should
      break a files that has many inserts even if it still keeps
      enough of the original contents, so that the broken pieces can
      later be matched with other files by -M or -C.  However, if such
      a broken pair does not get picked up by -M or -C, we would want
      to apply different criteria; namely, regardless of the amount of
      new material in the result, the determination of "rewrite"
      should be done by looking at the amount of original material
      still left in the result.  If you still have the original 97
      lines from a 100-line document, it does not matter if you add
      your own 13 lines to make a 110-line document, or if you add 903
      lines to make a 1000-line document.  It is not a rewrite but an
      in-place edit.  On the other hand, if you did lose 97 lines from
      the original, it does not matter if you added 27 lines to make a
      30-line document or if you added 997 lines to make a 1000-line
      document.  You did a complete rewrite in either case.
      This patch introduces a post-processing phase that runs after
      diffcore-rename matches up broken pairs diffcore-break creates.
      The purpose of this post-processing is to pick up these broken
      pieces and merge them back into in-place modifications.  For
      this, the score parameter -B option takes is changed into a pair
      of numbers, and it takes "-B99/80" format when fully spelled
      out.  The first number is the minimum amount of "edit" (same
      definition as what diffcore-rename uses, which is "sum of
      deletion and insertion") that a modification needs to have to be
      broken, and the second number is the minimum amount of "delete"
      a surviving broken pair must have to avoid being merged back
      together.  It can be abbreviated to "-B" to use default for
      both, "-B9" or "-B9/" to use 90% for "edit" but default (80%)
      for merge avoidance, or "-B/75" to use default (99%) "edit" and
      75% for merge avoidance.
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    • Junio C Hamano's avatar
      [PATCH] Tweak count-delta interface · 355e76a4
      Junio C Hamano authored
      Make it return copied source and insertion separately, so that
      later implementation of heuristics can use them more flexibly.
      This does not change the heuristics implemented in
      diffcore-rename nor diffcore-break in any way.
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
  23. 30 May, 2005 1 commit
    • Junio C Hamano's avatar
      [PATCH] Add -B flag to diff-* brothers. · f345b0a0
      Junio C Hamano authored
      A new diffcore transformation, diffcore-break.c, is introduced.
      When the -B flag is given, a patch that represents a complete
      rewrite is broken into a deletion followed by a creation.  This
      makes it easier to review such a complete rewrite patch.
      The -B flag takes the same syntax as the -M and -C flags to
      specify the minimum amount of non-source material the resulting
      file needs to have to be considered a complete rewrite, and
      defaults to 99% if not specified.
      As the new test t4008-diff-break-rewrite.sh demonstrates, if a
      file is a complete rewrite, it is broken into a delete/create
      pair, which can further be subjected to the usual rename
      detection if -M or -C is used.  For example, if file0 gets
      completely rewritten to make it as if it were rather based on
      file1 which itself disappeared, the following happens:
          The original change looks like this:
      	file0     --> file0' (quite different from file0)
      	file1     --> /dev/null
          After diffcore-break runs, it would become this:
      	file0     --> /dev/null
      	/dev/null --> file0'
      	file1     --> /dev/null
          Then diffcore-rename matches them up:
      	file1     --> file0'
      The internal score values are finer grained now.  Earlier
      maximum of 10000 has been raised to 60000; there is no user
      visible changes but there is no reason to waste available bits.
      Signed-off-by: default avatarJunio C Hamano <junkio@cox.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>