Commit Graph

177 Commits

Author SHA1 Message Date
Steffen Prohaska
21e5ad50fc safecrlf: Add mechanism to warn about irreversible crlf conversions
CRLF conversion bears a slight chance of corrupting data.
autocrlf=true will convert CRLF to LF during commit and LF to
CRLF during checkout.  A file that contains a mixture of LF and
CRLF before the commit cannot be recreated by git.  For text
files this is the right thing to do: it corrects line endings
such that we have only LF line endings in the repository.
But for binary files that are accidentally classified as text the
conversion can corrupt data.

If you recognize such corruption early you can easily fix it by
setting the conversion type explicitly in .gitattributes.  Right
after committing you still have the original file in your work
tree and this file is not yet corrupted.  You can explicitly tell
git that this file is binary and git will handle the file
appropriately.

Unfortunately, the desired effect of cleaning up text files with
mixed line endings and the undesired effect of corrupting binary
files cannot be distinguished.  In both cases CRLFs are removed
in an irreversible way.  For text files this is the right thing
to do because CRLFs are line endings, while for binary files
converting CRLFs corrupts data.

This patch adds a mechanism that can either warn the user about
an irreversible conversion or can even refuse to convert.  The
mechanism is controlled by the variable core.safecrlf, with the
following values:

 - false: disable safecrlf mechanism
 - warn: warn about irreversible conversions
 - true: refuse irreversible conversions

The default is to warn.  Users are only affected by this default
if core.autocrlf is set.  But the current default of git is to
leave core.autocrlf unset, so users will not see warnings unless
they deliberately chose to activate the autocrlf mechanism.

The safecrlf mechanism's details depend on the git command.  The
general principles when safecrlf is active (not false) are:

 - we warn/error out if files in the work tree can modified in an
   irreversible way without giving the user a chance to backup the
   original file.

 - for read-only operations that do not modify files in the work tree
   we do not not print annoying warnings.

There are exceptions.  Even though...

 - "git add" itself does not touch the files in the work tree, the
   next checkout would, so the safety triggers;

 - "git apply" to update a text file with a patch does touch the files
   in the work tree, but the operation is about text files and CRLF
   conversion is about fixing the line ending inconsistencies, so the
   safety does not trigger;

 - "git diff" itself does not touch the files in the work tree, it is
   often run to inspect the changes you intend to next "git add".  To
   catch potential problems early, safety triggers.

The concept of a safety check was originally proposed in a similar
way by Linus Torvalds.  Thanks to Dimitry Potapov for insisting
on getting the naked LF/autocrlf=true case right.

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
2008-02-06 13:07:28 -08:00
Dmitry Potapov
28624193b2 treat any file with NUL as binary
There are two heuristics in Git to detect whether a file is binary
or text. One in xdiff-interface.c (which is taken from GNU diff)
relies on existence of the NUL byte at the beginning. However,
convert.c used a different heuristic, which relied on the percent
of non-printable symbols (less than 1% for text files).

Due to differences in detection whether a file is binary or not,
it was possible that a file that diff treats as binary could be
treated as text by CRLF conversion. This is very confusing for a
user who sees that 'git diff' shows the file as binary expects it
to be added as binary.

This patch makes is_binary to consider any file that contains at
least one NUL character as binary, to ensure that the heuristics
used for CRLF conversion is tighter than what is used by diff.

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-16 09:10:34 -08:00
Johannes Sixt
546bb58232 Use the asyncronous function infrastructure to run the content filter.
Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-21 01:30:42 -04:00
Johannes Sixt
7683b6e81f Avoid a dup2(2) in apply_filter() - start_command() can do it for us.
When apply_filter() runs the external (clean or smudge) filter program, it
needs to pass the writable end of a pipe as its stdout. For this purpose,
it used to dup2(2) the file descriptor explicitly to stdout. Now we use
the facilities of start_command() to do it for us.

Furthermore, the path argument of a subordinate function, filter_buffer(),
was not used, so here we replace it to pass the fd instead.

Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-21 01:30:42 -04:00
Johannes Sixt
dc1bfdcd1a Use start_command() to run content filters instead of explicit fork/exec.
The previous code already used finish_command() to wait for the process
to terminate, but did not use start_command() to run it.

Signed-off-by: Johannes Sixt <johannes.sixt@telecom.at>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-21 01:30:39 -04:00
Pierre Habouzit
90d16ec032 Fix in-place editing functions in convert.c
* crlf_to_git and ident_to_git:

  Don't grow the buffer if there is enough space in the first place.
  As a side effect, when the editing is done "in place", we don't grow, so
  the buffer pointer doesn't changes, and `src' isn't invalidated anymore.

  Thanks to Bernt Hansen for the bug report.

* apply_filter:

  Fix memory leak due to fake in-place editing that didn't collected the
  old buffer when the filter succeeds. Also a cosmetic fix.

Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Lars Hjemli <hjemli@gmail.com>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-10-15 21:38:09 -04:00
Pierre Habouzit
b315c5c081 strbuf change: be sure ->buf is never ever NULL.
For that purpose, the ->buf is always initialized with a char * buf living
in the strbuf module. It is made a char * so that we can sloppily accept
things that perform: sb->buf[0] = '\0', and because you can't pass "" as an
initializer for ->buf without making gcc unhappy for very good reasons.

strbuf_init/_detach/_grow have been fixed to trust ->alloc and not ->buf
anymore.

as a consequence strbuf_detach is _mandatory_ to detach a buffer, copying
->buf isn't an option anymore, if ->buf is going to escape from the scope,
and eventually be free'd.

API changes:
  * strbuf_setlen now always works, so just make strbuf_reset a convenience
    macro.
  * strbuf_detatch takes a size_t* optional argument (meaning it can be
    NULL) to copy the buffer's len, as it was needed for this refactor to
    make the code more readable, and working like the callers.

Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-29 02:13:33 -07:00
Pierre Habouzit
182af8343c Use xmemdupz() in many places.
Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-18 17:42:17 -07:00
Pierre Habouzit
ba3ed09728 Now that cache.h needs strbuf.h, remove useless includes.
Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-16 17:30:03 -07:00
Pierre Habouzit
5ecd293d14 Rewrite convert_to_{git,working_tree} to use strbuf's.
* Now, those functions take an "out" strbuf argument, where they store their
  result if any. In that case, it also returns 1, else it returns 0.
* those functions support "in place" editing, in the sense that it's OK to
  call them this way:
    convert_to_git(path, sb->buf, sb->len, sb);
  When doable, conversions are done in place for real, else the strbuf
  content is just replaced with the new one, transparentely for the caller.

If you want to create a new filter working this way, being the accumulation
of filter1, filter2, ... filtern, then your meta_filter would be:

    int meta_filter(..., const char *src, size_t len, struct strbuf *sb)
    {
        int ret = 0;
        ret |= filter1(...., src, len, sb);
        if (ret) {
            src = sb->buf;
            len = sb->len;
        }
        ret |= filter2(...., src, len, sb);
        if (ret) {
            src = sb->buf;
            len = sb->len;
        }
        ....
        return ret | filtern(..., src, len, sb);
    }

That's why subfilters the convert_to_* functions called were also rewritten
to work this way.

Signed-off-by: Pierre Habouzit <madcoder@debian.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-16 17:30:03 -07:00
René Scharfe
89b4256cfb Remove unused function convert_sha1_file()
convert_sha1_file() became unused by the previous patch -- remove it.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-09-03 16:46:23 -07:00
Andy Parkins
c23290d528 Fix mishandling of $Id$ expanded in the repository copy in convert.c
If the repository contained an expanded ident keyword (i.e. $Id:XXXX$),
then the wrong bytes were discarded, and the Id keyword was not
expanded.  The fault was in convert.c:ident_to_worktree().

Previously, when a "$Id:" was found in the repository version,
ident_to_worktree() would search for the next "$" after this, and
discarded everything it found until then.  That was done with the loop:

    do {
        ch = *cp++;
        if (ch == '$')
            break;
        rem--;
    } while (rem);

The above loop left cp pointing one character _after_ the final "$"
(because of ch = *cp++).  This was different from the non-expanded case,
were cp is left pointing at the "$", and was different from the comment
which stated "discard up to but not including the closing $".  This
patch fixes that by making the loop:

    do {
        ch = *cp;
        if (ch == '$')
            break;
        cp++;
        rem--;
    } while (rem);

That is, cp is tested _then_ incremented.

This loop exits if it finds a "$" or if it runs out of bytes in the
source.  After this loop, if there was no closing "$" the expansion is
skipped, and the outer loop is allowed to continue leaving this
non-keyword as it was.  However, when the "$" is found, size is
corrected, before running the expansion:

    size -= (cp - src);

This is wrong; size is going to be corrected anyway after the expansion,
so there is no need to do it here.  This patch removes that redundant
correction.

To help find this bug, I heavily commented the routine; those comments
are included here as a bonus.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-26 01:12:43 -07:00
Andy Parkins
760f0c62ef Fix crlf attribute handling to match documentation
gitattributes.txt says, of the crlf attribute:

 Set::
    Setting the `crlf` attribute on a path is meant to mark
    the path as a "text" file.  'core.autocrlf' conversion
    takes place without guessing the content type by
    inspection.

That is to say that the crlf attribute does not force the file to have
CRLF line endings, instead it removes the autocrlf guesswork and forces
the file to be treated as text.  Then, whatever line ending is defined
by the autocrlf setting is applied.

However, that is not what convert.c was doing.  The conversion to CRLF
was being skipped in crlf_to_worktree() when the following condition was
true:

 action == CRLF_GUESS && auto_crlf <= 0

That is to say conversion took place when not in guess mode (crlf attribute
not specified) or core.autocrlf set to true.  This was wrong.  It meant
that the crlf attribute being on for a given file _forced_ CRLF
conversion, when actually it should force the file to be treated as
text, and converted accordingly.  The real test should simply be

 auto_crlf <= 0

That is to say, if core.autocrlf is falsei (or input), conversion from
LF to CRLF is never done.  When core.autocrlf is true, conversion from
LF to CRLF is done only when in CRLF_GUESS (and the guess is "text"), or
CRLF_TEXT mode.

Similarly for crlf_to_worktree(), if core.autocrlf is false, no conversion
should _ever_ take place.  In reality it was only not taking place if
core.autocrlf was false _and_ the crlf attribute was unspecified.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-18 17:02:47 -07:00
René Scharfe
5e6cfc80e2 git-archive: convert archive entries like checkouts do
As noted by Johan Herland, git-archive is a kind of checkout and needs
to apply any checkout filters that might be configured.

This patch adds the convenience function convert_sha1_file which returns
a buffer containing the object's contents, after converting, if necessary
(i.e. it's a combination of read_sha1_file and convert_to_working_tree).
Direct calls to read_sha1_file in git-archive are then replaced by calls
to convert_sha1_file.

Since convert_sha1_file expects its path argument to be NUL-terminated --
a convention it inherits from convert_to_working_tree -- the patch also
changes the path handling in archive-tar.c to always NUL-terminate the
string.  It used to solely rely on the len field of struct strbuf before.

archive-zip.c already NUL-terminates the path and thus needs no such
change.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-18 16:36:45 -07:00
Andy Parkins
af9b54bb2c Use $Id$ as the ident attribute keyword rather than $ident$ to be consistent with other VCSs
$Id$ is present already in SVN and CVS; it would mean that people
converting their existing repositories won't have to make any changes to
the source files should they want to make use of the ident attribute.

Given that it's a feature that's meant to calm those very people, it
seems obtuse to make them edit every file just to make use of it.

I think that bzr uses $Id$; Mercurial has examples hooks for $Id$;
monotone has $Id$ on its wishlist.  I can't think of a good reason not
to stick with the de-facto standard and call ours $Id$ instead of
$ident$.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-14 19:03:32 -07:00
Junio C Hamano
aa4ed402c9 Add 'filter' attribute and external filter driver definition.
The interface is similar to the custom low-level merge drivers.

First you configure your filter driver by defining 'filter.<name>.*'
variables in the configuration.

	filter.<name>.clean	filter command to run upon checkin
	filter.<name>.smudge	filter command to run upon checkout

Then you assign filter attribute to each path, whose name
matches the custom filter driver's name.

Example:

	(in .gitattributes)
	*.c	filter=indent

	(in config)
	[filter "indent"]
		clean = indent
		smudge = cat

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-24 22:38:51 -07:00
Junio C Hamano
3fed15f568 Add 'ident' conversion.
The 'ident' attribute set to path squashes "$ident:<any bytes
except dollor sign>$" to "$ident$" upon checkin, and expands it
to "$ident: <blob SHA-1> $" upon checkout.

As we have two conversions that affect checkin/checkout paths,
clarify how they interact with each other.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-24 22:38:51 -07:00
Alex Riesen
67e22ed58f Fix a typo in crlf conversion code
Also, noticed by valgrind: the code caused a read out-of-bounds.
Some comments updated as well (they still reflected old calling
conventions).

Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-22 10:44:38 -07:00
Junio C Hamano
6073ee8571 convert.c: restructure the attribute checking part.
This separates the checkattr() call and interpretation of the
returned value specific to the 'crlf' attribute into separate
routines, so that we can run a single call to checkattr() to
check for more than one attributes, and then interprete what
the returned settings mean separately.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-21 11:55:23 -07:00
Alex Riesen
ac78e54804 Simplify calling of CR/LF conversion routines
Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-20 23:24:34 -07:00
Junio C Hamano
163b959194 Update 'crlf' attribute semantics.
This updates the semantics of 'crlf' so that .gitattributes file
can say "this is text, even though it may look funny".

Setting the `crlf` attribute on a path is meant to mark the path
as a "text" file.  'core.autocrlf' conversion takes place
without guessing the content type by inspection.

Unsetting the `crlf` attribute on a path is meant to mark the
path as a "binary" file.  The path never goes through line
endings conversion upon checkin/checkout.

Unspecified `crlf` attribute tells git to apply the
`core.autocrlf` conversion when the file content looks like
text.

Setting the `crlf` attribut to string value "input" is similar
to setting the attribute to `true`, but also forces git to act
as if `core.autocrlf` is set to `input` for the path.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-19 22:37:44 -07:00
Junio C Hamano
a5e92abde6 Fix funny types used in attribute value representation
It was bothering me a lot that I abused small integer values
casted to (void *) to represent non string values in
gitattributes.  This corrects it by making the type of attribute
values (const char *), and using the address of a few statically
allocated character buffer to denote true/false.  Unset attributes
are represented as having NULLs as their values.

Added in-header documentation to explain how git_checkattr()
routine should be called.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-18 16:17:13 -07:00
Junio C Hamano
515106fa13 Allow more than true/false to attributes.
This allows you to define three values (and possibly more) to
each attribute: true, false, and unset.

Typically the handlers that notice and act on attribute values
treat "unset" attribute to mean "do your default thing"
(e.g. crlf that is unset would trigger "guess from contents"),
so being able to override a setting to an unset state is
actually useful.

 - If you want to set the attribute value to true, have an entry
   in .gitattributes file that mentions the attribute name; e.g.

	*.o	binary

 - If you want to set the attribute value explicitly to false,
   use '-'; e.g.

	*.a	-diff

 - If you want to make the attribute value _unset_, perhaps to
   override an earlier entry, use '!'; e.g.

	*.a	-diff
	c.i.a	!diff

This also allows string values to attributes, with the natural
syntax:

	attrname=attrvalue

but you cannot use it, as nobody takes notice and acts on
it yet.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-17 01:04:59 -07:00
Junio C Hamano
201ac8efc7 Fix 'crlf' attribute semantics.
Earlier we said 'crlf lets the path go through core.autocrlf
process while !crlf disables it altogether'.  This fixes the
semantics to:

 - Lack of 'crlf' attribute makes core.autocrlf to apply
   (i.e. we guess based on the contents and if platform
   expresses its desire to have CRLF line endings via
   core.autocrlf, we do so).

 - Setting 'crlf' attribute to true forces CRLF line endings in
   working tree files, even if blob does not look like text
   (e.g. contains NUL or other bytes we consider binary).

 - Setting 'crlf' attribute to false disables conversion.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-15 13:35:45 -07:00
Junio C Hamano
35ebfd6a0c Define 'crlf' attribute.
This defines the semantics of 'crlf' attribute as an example.
When a path has this attribute unset (i.e. '!crlf'), autocrlf
line-end conversion is not applied.

Eventually we would want to let users to build a pipeline of
processing to munge blob data to filesystem format (and in the
other direction) based on combination of attributes, and at that
point the mechanism in convert_to_{git,working_tree}() that
looks at 'crlf' attribute needs to be enhanced.  Perhaps the
existing 'crlf' would become the first step in the input chain,
and the last step in the output chain.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-14 08:57:06 -07:00
Linus Torvalds
d7f4633405 Make AutoCRLF ternary variable.
This allows you to do:

	[core]
		AutoCRLF = input

and it should do only the CRLF->LF translation (ie it simplifies CRLF only
when reading working tree files, but when checking out files, it leaves
the LF alone, and doesn't turn it into a CRLF).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-14 11:19:28 -08:00
Linus Torvalds
6c510bee20 Lazy man's auto-CRLF
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.

Anyway, BY DEFAULT it is off regardless, because it requires a

	[core]
		AutoCRLF = true

in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).

But you can actually enable it on UNIX, and it will cause:

 - "git update-index" will write blobs without CRLF
 - "git diff" will diff working tree files without CRLF
 - "git checkout" will write files to the working tree _with_ CRLF

and things work fine.

Funnily, it actually shows an odd file in git itself:

	git clone -n git test-crlf
	cd test-crlf
	git config core.autocrlf true
	git checkout
	git diff

shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.

Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).

I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.

NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.

The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-14 11:19:22 -08:00