As fast-import is quite strict about its input and die()'s anytime
something goes wrong it can be difficult for a frontend developer
to troubleshoot why fast-import rejected their input, or to even
determine what input command it rejected.
This change introduces a custom handler for Git's die() routine.
When we receive a die() for any reason (fast-import or a lower level
core Git routine we called) the error is first dumped onto stderr
and then a more extensive crash report file is prepared in GIT_DIR.
Finally we exit the process with status 128, just like the stock
builtin die handler.
An internal flag is set to prevent any further die()'s that may be
invoked during the crash report generator from causing us to enter
into an infinite loop. We shouldn't die() from our crash report
handler, but just in case someone makes a future code change we are
prepared to gaurd against small mistakes turning into huge problems
for the end-user.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The existing checkpoint command is very useful to force fast-import
to dump the branches out to disk so that standard Git tools can
access them and the objects they refer to. However there was not a
way to know when fast-import had finished executing the checkpoint
and it was safe to read those refs.
The progress command can be used to make fast-import output any
message of the frontend's choosing to standard out. The frontend
can scan for these messages using select() or poll() to monitor a
pipe connected to the standard output of fast-import.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
For the same reasons as the prior change we want to allow frontends
to omit the trailing LF that usually delimits commands. In some
cases these just make the input stream more verbose looking than
it needs to be, and its just simpler for the frontend developer to
get started if our parser is slightly more lenient about where an
LF is required and where it isn't.
To make this optional LF feature work we now have to buffer up to one
line of input in command_buf. This buffering can happen if we look
at the current input command but don't recognize it at this point
in the code. In such a case we need to "unget" the entire line,
but we cannot depend upon the stdio library to let us do ungetc()
for that many characters at once.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
A few fast-import frontend developers have found it odd that we
require the LF following a `data` command, especially in the exact
byte count format. Technically we don't need this LF to parse
the stream properly, but having it here does make the stream more
readable to humans. We can easily make the LF optional by peeking
at the next byte available from the stream and pushing it back into
the buffer if its not LF.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Several frontend developers have asked that some form of stream
comments be permitted within a fast-import data stream. This way
they can include information from their own frontend program about
where specific data was taken from in the source system, or about
a decision that their frontend may have made while creating the
fast-import data stream.
This change introduces comments in the Bourne-shell/Tcl/Perl style.
Lines starting with '#' are ignored, up to and including the LF.
Unlike the above mentioned three languages however we do not look for
and ignore leading whitespace. This just simplifies the definition
of the comment format and the code that parses them.
To make comments work we had to stop using read_next_command() within
cmd_data() and directly invoke read_line() during the inline variant
of the function. This is necessary to retain any lines of the
input data that might otherwise look like a comment to fast-import.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Instead of growing our buffer by hand during the inline variant of
cmd_data() we can save a few lines of code and just use the nifty
new ALLOC_GROW macro already available to us.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Michael Haggerty <mhagger@alum.mit.edu> noticed while debugging a
Git backend for cvs2svn that fast-import was barfing when he tried
to use "TAG_FIXUP" as a branch name for temporary work needed to
cleanup the tree prior to creating an annotated tag object.
The reason we were rejecting the branch name was check_ref_format()
returns -2 when there are less than 2 '/' characters in the input
name. TAG_FIXUP has 0 '/' characters, but is technically just as
valid of a ref as HEAD and MERGE_HEAD, so we really should permit it
(and any other similar looking name) during import.
New test cases have been added to make sure we still detect very
wrong branch names (e.g. containing [ or starting with .) and yet
still permit reasonable names (e.g. TAG_FIXUP).
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Something probably assumed that HT indentation is 4 characters.
Signed-off-by: Alex Riesen <raa.lkml@gmail.com>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
xmkstemp() performs error checking and prints a standard error message when
an error occur.
Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some source material (e.g. Subversion dump files) perform directory
renames by telling us the directory was copied, then deleted in the
same revision. This makes it difficult for a frontend to convert
such data formats to a fast-import stream, as all the frontend has
on hand is "Copy a/ to b/; Delete a/" with no details about what
files are in a/, unless the frontend also kept track of all files.
The new 'C' subcommand within a commit allows the frontend to make a
recursive copy of one path to another path within the branch, without
needing to keep track of the individual file paths. The metadata
copy is performed in memory efficiently, but is implemented as a
copy-immediately operation, rather than copy-on-write.
With this new 'C' subcommand frontends could obviously implement an
'R' (rename) on their own as a combination of 'C' and 'D' (delete),
but since we have already offered up 'R' in the past and it is a
trivial thing to keep implemented I'm not going to deprecate it.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some source material (e.g. Subversion dump files) perform directory
renames without telling us exactly which files in that subdirectory
were moved. This makes it hard for a frontend to convert such data
formats to a fast-import stream, as all the frontend has on hand
is "Rename a/ to b/" with no details about what files are in a/,
unless the frontend also kept track of all files.
The new 'R' subcommand within a commit allows the frontend to
rename either a file or an entire subdirectory, without needing to
know the object's SHA-1 or the specific files contained within it.
The rename is performed as efficiently as possible internally,
making it cheaper than a 'D'/'M' pair for a file rename.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When e8438420bb allowed us to reload
the marks table on subsequent runs of fast-import we really broke
things, as we set pack_id to MAX_PACK_ID for any objects we imported
into the marks table. Creating a branch from that mark should fail
as we attempt to read the object through a non-existant packed_git
pointer. Instead we have to use the normal Git object system to
locate the older commit, as we ourselves do not have a reference
to the packed_git it resides in.
This bug only occurred because t9300 was not complete enough.
When we added the --import-marks feature we didn't actually test
its implementation enough to verify the function worked as intended.
I have corrected that, and included the changes as part of this fix.
Prior versions of fast-import fail the new test(s); this commit
allows them to pass.
Credit for this bug find goes to Simon Hausmann <simon@lst.de> as
he recently identified a similiar bug in the tree lazy-loading path.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
To resolve a corner case uncovered by Simon Hausmann I need to
reuse the logic for the SHA-1 expression version of the 'from '
command within the mark version of the 'from ' command. This change
doesn't alter any functionality, but is merely breaking the common
code out to a function that I can reuse.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Commit a5c1780a03 sets the pack_id of existing
objects to MAX_PACK_ID. When the same object is referenced later again it is
found in the local object hash. With such a pack_id fast-import should not try
to locate that object in the newly created pack(s).
Signed-off-by: Simon Hausmann <simon@lst.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Fix uninitialized last_object->no_free variable that is accessed in
store_object.
Signed-off-by: Simon Hausmann <simon@lst.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
git-checkout is also adapted to make use of this new option
instead of the handcrafted command sequence.
Signed-off-by: Sven Verdoolaege <skimo@kotnet.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Include a generalized fixup_pack_header_footer() in this new file.
Needed by git-repack --max-pack-size feature in a later patchset.
[sp: Moved close(pack_fd) to callers, to support index-pack, and
changed name to better indicate it is for packfiles.]
Signed-off-by: Dana L. How <danahow@gmail.com>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* maint:
http.c: Fix problem with repeated calls of http_init
Add missing reference to GIT_COMMITTER_DATE in git-commit-tree documentation
Fix import-tars fix.
Update .mailmap with "Michael"
Do not barf on too long action description
Catch empty pathnames in trees during fsck
Don't allow empty pathnames in fast-import
import-tars: be nice to wrong directory modes
git-svn: Added 'find-rev' command
git shortlog documentation: add long options and fix a typo
riddochc on #git noticed corruption caused by import-tars. This
was fixed in the prior commit by Dscho, but fast-import was wrong
to have allowed a tree to be created with an empty string as the
filename. No operating system allows this, and Git itself doesn't
accept this into the index.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some users of fast-import have been trying to use it to rewrite
commits and trees, an activity where the all of the relevant blobs
are already available from the existing packfiles. In such a case
we don't want to repack a blob, even if the frontend application
has supplied us the raw data rather than a mark or a SHA-1 name.
I'm intentionally only checking the packfiles that existed when
fast-import started and am always ignoring all loose object files.
We ignore loose objects because fast-import tends to operate on a
very large number of objects in a very short timespan, and it is
usually creating new objects, not reusing existing ones. In such
a situtation the majority of the objects will not be found in the
existing packfiles, nor will they be loose object files. If the
frontend application really wants us to look at loose object files,
then they can just repack the repository before running fast-import.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This fixes a problem reported by Randal Schwartz:
>I finally tracked down all the (albeit inconsequential) errors I was getting
>on both OpenBSD and OSX. It's the warn() function in usage.c. There's
>warn(3) in BSD-style distros. It'd take a "great rename" to change it, but if
>someone with better C skills than I have could do that, my linker and I would
>appreciate it.
It was annoying to me, too, when I was doing some mergetool testing on
Mac OS X, so here's a fix.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Randal L. Schwartz" <merlyn@stonehenge.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
When some operations are interrupted (or "die()'d" or crashed) then the
partial object/pack/index file may remain around. Make it more obvious
in their name that those files are temporary stuff and can be cleaned up
if no operation is in progress.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Jeff King pointed out that these casts are quite unnecessary, as
the compiler should be doing them anyway, and may cause problems
in the future if the size of the argument for to_atom were to ever
be increased.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When building up a tree for a commit, fast-import
dynamically allocates memory for the tree entries. When more
space is needed, the allocated memory is increased by a
constant amount. For very large trees, this means
re-allocating and memcpy()ing the memory O(n) times.
To compound this problem, releasing the previous tree
resource does not free the memory; it is kept in a pool
for future trees. This means that each of the O(n)
allocations will consume increasing amounts of memory,
giving O(n^2) memory consumption.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* 'master' of git://repo.or.cz/git/fastimport:
Allow fast-import frontends to reload the marks table
Use atomic updates to the fast-import mark file
Preallocate memory earlier in fast-import
I'm giving fast-import a lesson on how to reload the marks table
using the same format it outputs with --export-marks. This way
a frontend can reload the marks table from a prior import, making
incremental imports less painful.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When we allow fast-import frontends to reload a mark file from a
prior session we want to let them use the same file as they exported
the marks to. This makes it very simple for the frontend to save
state across incremental imports.
But we don't want to lose the old marks table if anything goes wrong
while writing our current marks table. So instead of truncating and
overwriting the path specified to --export-marks we use the standard
lockfile code to write the current marks out to a temporary file,
then rename it over the old marks table.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
I'm about to teach fast-import how to reload the marks file created
by a prior session. The general approach that I want to use is to
immediately parse the marks file when the specific argument is found
in argv, thereby allowing the caller to supply multiple marks files,
as the mark space can be sparsely populated.
To make that work out we need to allocate our object tables before
we parse the command line options. Since none of these tables
depend on the command line options, we can easily relocate them.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Always use an off_t value in pack-objects anytime we are dealing
with an offset to some data within a packfile.
Also fixed a minor uintmax_t that was incorrectly defined before.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Not all platforms have declared 'unsigned long' to be a 64 bit value,
but we want to support a 64 bit packfile (or close enough anyway)
in the near future as some projects are getting large enough that
their packed size exceeds 4 GiB.
By using off_t, the POSIX type that is declared to mean an offset
within a file, we support whatever maximum file size the underlying
operating system will handle. For most modern systems this is up
around 2^60 or higher.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
We shouldn't attempt to assign constant strings into char*, as the
string is not writable at runtime. Likewise we should always be
treating unsigned values as unsigned values, not as signed values.
Most of these are very straightforward. The only exception is the
(unnecessary) xstrdup/free in builtin-branch.c for the detached
head case. Since this is a user-level interactive type program
and that particular code path is executed no more than once, I feel
that the extra xstrdup call is well worth the easy elimination of
this warning.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
* maint:
fast-import: Fail if a non-existant commit is used for merge
fast-import: Avoid infinite loop after reset
[sp: Minor evil merge to deal with type_names array moving
to be private in 'master'.]
Johannes Sixt noticed during one of his own imports that fast-import
did not fail if a non-existant commit is referenced by SHA-1 value
as an argument to the 'merge' command. This allowed the user to
unknowingly create commits that would fail in fsck, as the commit
contents would not be completely reachable.
A side effect of this bug was that a frontend process could mark
any SHA-1 object (blob, tree, tag) as a parent of a merge commit.
This should also fail in fsck, as the commit is not a valid commit.
We now use the same rule as the 'from' command. If a commit is
referenced in the 'merge' command by hex formatted SHA-1 then the
SHA-1 must be a commit or a tag that can be peeled back to a commit,
the commit must already exist, and must be readable by the core Git
infrastructure code. This requirement means that the commit must
have existed prior to fast-import starting, or the commit must have
been flushed out by a prior 'checkpoint' command.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Johannes Sixt noticed that a 'reset' command applied to a branch that
is already active in the branch LRU cache can cause fast-import to
relink the same branch into the LRU cache twice. This will cause
the LRU cache to contain a cycle, making unload_one_branch run in an
infinite loop as it tries to select the oldest branch for eviction.
I have trivially fixed the problem by adding an active bit to
each branch object; this bit indicates if the branch is already
in the LRU and allows us to avoid trying to add it a second time.
Converting the pack_id field into a bitfield makes this change take
up no additional memory.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We currently have two parallel notation for dealing with object types
in the code: a string and a numerical value. One of them is obviously
redundent, and the most used one requires more stack space and a bunch
of strcmp() all over the place.
This is an initial step for the removal of the version using a char array
found in object reading code paths. The patch is unfortunately large but
there is no sane way to split it in smaller parts without breaking the
system.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Sometime typename() is used, sometimes type_names[] is accessed directly.
Let's enforce typename() all the time which allows for validating the
type.
Also let's add a function to go from a name to a type and use it instead
of manual memcpy() when appropriate.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Previous step converted use of strncmp() with literal string
mechanically even when the result is only used as a boolean:
if (!strncmp("foo", arg, 3)) ==> if (!(-prefixcmp(arg, "foo")))
This step manually cleans them up to read:
if (!prefixcmp(arg, "foo"))
Signed-off-by: Junio C Hamano <junkio@cox.net>
This mechanically converts strncmp() to use prefixcmp(), but only when
the parameters match specific patterns, so that they can be verified
easily. Leftover from this will be fixed in a separate step, including
idiotic conversions like
if (!strncmp("foo", arg, 3))
=>
if (!(-prefixcmp(arg, "foo")))
This was done by using this script in px.perl
#!/usr/bin/perl -i.bak -p
if (/strncmp\(([^,]+), "([^\\"]*)", (\d+)\)/ && (length($2) == $3)) {
s|strncmp\(([^,]+), "([^\\"]*)", (\d+)\)|prefixcmp($1, "$2")|;
}
if (/strncmp\("([^\\"]*)", ([^,]+), (\d+)\)/ && (length($1) == $3)) {
s|strncmp\("([^\\"]*)", ([^,]+), (\d+)\)|(-prefixcmp($2, "$1"))|;
}
and running:
$ git grep -l strncmp -- '*.c' | xargs perl px.perl
Signed-off-by: Junio C Hamano <junkio@cox.net>
Thanks to Simon 'corecode' Schubert <corecode@fs.ei.tum.de> for
the clean-up. Defining the C99 standard PRIuMAX when necessary
replaces UM_FMT and the awkward UM10_FMT. There are no direct
C99 translations for other uses of NO_C99_FORMAT in git, alas.
Signed-off-by: Jason Riedy <ejr@cs.berkeley.edu>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Define UM_FMT and UM10_FMT and use in place of %ju and %10ju,
respectively. Both format as unsigned long long, so this
assumes the compiler supports long long.
Signed-off-by: Jason Riedy <jason@acm.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
It was suggested on the mailing list that being able to use `from`
in any commit to reset the current branch is useful in some types of
importers, such as a darcs importer.
We originally did not permit resetting an existing branch with a
new `from` command during a `commit` command, but this restriction
was only to help debug the hacked up cvs2svn that Jon Smirl was
developing in parallel with git-fast-import. It is probably more
of a problem to disallow it than to allow it. So now we permit a
`from` during any `commit`.
While making the changes required to permit multiple `from`
commands on the same branch, I discovered we no longer needed the
last_commit field to be set to 0 during a reset, so that was removed.
(Reset was originally setting the field to 0 to signal cmd_from()
that it was OK to execute on the branch.)
While poking around in this section of fast-import I also realized
the `reset` command was not working as intended if the corresponding
`from` command was omitted (as allowed by the BNF grammar and the
code). If `from` was omitted we cleared out the tree but we left
the tree SHA-1 and parent commit SHA-1 intact. This is not what
the user intended in this case. Instead they would be trying to
reset the branch to have no parent and to have no tree, making the
branch look new-born during the next commit. We now clear these
SHA-1 values during `reset`, ensuring the branch looks new-born if
`from` does not get supplied.
New test cases for these were also added.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Most users don't need the pack boundary information that fast-import
was printing to standard output, especially if they were calling
it with --quiet.
Those users who do want this information probably want it captured
so they can go back and use it to repack the imported repository.
So dumping the boundary commits to a log file makes more sense then
printing them to standard output.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Not on all platforms are size_t and unsigned long equivalent.
Since I do not know how portable %z is, I play safe, and just
cast the respective variables to unsigned long.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
Apparently fast-import used to die a horrible death if we
were unable to open the marks file for output. This is
slightly less than ideal, especially now that we dump
the marks as part of the `checkpoint` command.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the frontend asks us to checkpoint (via the explicit checkpoint
command) its probably because they are afraid the current import
will crash/fail/whatever and want to make sure they can pickup from
the last checkpoint. To do that sort of recovery, we will need the
current tip of every branch and tag available at the next startup.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Often users will be running fast-import from within a larger frontend
process, and this may be a frequent periodic tool such as a future
edition of `git-svn fetch`. We don't want to bombard users with our
large stats output if they won't be interested in it, so `--quiet`
is now an option to make gfi be more silent.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some frontends may not be able to (easily) keep track of which files
are included in the branch, and which aren't. Performing this
tracking can be tedious and error prone for the frontend to do,
especially if its foreign data source cannot supply the changed
path list on a per-commit basis.
fast-import now allows a frontend to request that a branch's tree
be wiped clean (reset to the empty tree) at the start of a commit,
allowing the frontend to feed in all paths which belong on the branch.
This is ideal for a tar-file importer frontend, for example, as
the frontend just needs to reformat the tar data stream into a gfi
data stream, which may be something a few Perl regexps can take
care of. :)
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If fast-import is being used to update an existing branch of
a repository, the user may not want to lose commits if another
process updates the same ref at the same time. For example, the
user might be using fast-import to make just one or two commits
against a live branch.
We now perform a fast-forward check during the ref updating process.
If updating a branch would cause commits in that branch to be lost,
we skip over it and display the new SHA1 to standard error.
This new default behavior can be overridden with `--force`, like
git-push and git-fetch.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Since some frontends may be working with source material where
the dates are only readily available as RFC 2822 strings, it is
more friendly if fast-import exposes Git's parse_date() function
to handle the conversion. This way the frontend doesn't need
to perform the parsing itself.
The new --date-format option to fast-import can be used by a
frontend to select which format it will supply date strings in.
The default is the standard `raw` Git format, which fast-import
has always supported. Format rfc2822 can be used to activate the
parse_date() function instead.
Because fast-import could also be useful for creating new, current
commits, the format `now` is also supported to generate the current
system timestamp. The implementation of `now` is a trivial call
to datestamp(), but is actually a whole whopping 3 lines so that
fast-import can verify the frontend really meant `now`.
As part of this change I have added validation of the `raw` date
format. Prior to this change fast-import would accept anything
in a `committer` command, even if it was seriously malformed.
Now fast-import requires the '> ' near the end of the string and
verifies the timestamp is formatted properly.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There is no need to check for a NULL pointer before invoking free(),
the runtime library automatically performs this check anyway and
does nothing if a NULL pointer is supplied.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Junio noticed that I was using a different style in fast-import
for returned pointers than the rest of Git. Before merging this
code into the main git.git tree I'd like to make it consistent,
as this style variation was not intentional.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Junio noticed these warnings/errors in fast-import when compiling
with `-Werror -ansi -pedantic`. A few changes are to reduce compiler
warnings, while one (in cmd_merge) is a bug fix.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The --branch-log option and its associated code hasn't been used in
several months, as its not really very useful for debugging fast-import
or a frontend. I don't plan on supporting it in this state long-term,
so I'm killing it now before it gets distributed to a wider audience.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The current implementation of shell-style quoted refnames and
SHA-1 expressions within fast-import contains a bad memory leak.
We leak the unquoted strings used by the `from` and `merge`
commands, maybe others. Its also just muddling up the docs.
Since Git refnames cannot contain LF, and that is our delimiter
for the end of the refname, and we accept any other character
as-is, there is no reason for these strings to support quoting,
except to be nice to frontends. But frontends shouldn't be
expecting to use funny refs in Git, and its just as simple to
never quote them as it is to always pass them through the same
quoting filter as pathnames. So frontends should never quote
refs, or ref expressions.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some structs are allocated rather frequently, but were using integer
types which were far larger than required to actually store their
full value range.
As packfiles are limited to 4 GiB we don't need more than 32 bits to
store the offset of an object within that packfile, an `unsigned long`
on a 64 bit system is likely a 64 bit unsigned value. Saving 4 bytes
per object on a 64 bit system can add up fast on any sizable import.
As atom strings are strictly single components in a path name these
are probably limited to just 255 bytes by the underlying OS. Going
to that short of a string is probably too restrictive, but certainly
`unsigned int` is far too large for their lengths. `unsigned short`
is a reasonable limit.
Modes within a tree really only need two bytes to store their whole
value; using `unsigned int` here is vast overkill. Saving 4 bytes
per file entry in an active branch can add up quickly on a project
with a large number of files.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This command isn't encouraged (as its slow) but it does exist and
is accepted, so it still should be covered in the BNF.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
git-fast-import requires use of inttypes.h, but the master branch has
added it to git-compat-util differently than git-fast-import originally
had used it. This merge back of master to the fast-import topic is to
get (and use) inttypes.h the way master is using it.
This is a partially evil merge to remove the call to setup_ident(),
as the master branch now contains a change which makes this implicit
and therefore removed the function declaration. (commit 01754769).
Conflicts:
git-compat-util.h
Its very annoying to need to specify the file content ahead of a
commit and use marks to connect the individual blobs to the commit's
file modification entry, especially if the frontend can't/won't
generate the blob SHA1s itself. Instead it would much easier to
use if we can accept the blob data at the same time as we receive
each file_change line.
Now fast-import accepts 'inline' instead of a mark idnum or blob
SHA1 within the 'M' type file_change command. If an inline is
detected the very next line must be a 'data n' command, supplying
the file data.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
During testing its nice to not have to feed the length of a data
chunk to the 'data' command of fast-import. Instead we would
prefer to be able to establish a data chunk much like shell's <<
operator and use a line delimiter to denote the end of the input.
So now if a data command is started as 'data <<EOF' we will look
for a terminator line containing only the string EOF on that line.
Once found, we stop the data command. Everything between the two
lines is used as the data value.
The 'data <<' syntax is slower than 'data n', as we don't know how
many bytes to expect and instead must grow our buffer on the fly.
It also has the problem that the frontend must use a string which
will not appear on a line by itself in the input, and the data
region will always end in an LF. For these reasons real import
frontends are encouraged to continue to use _only_ 'data n'.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The --objects command line option is rather unnecessary. Internally
we allocate objects in 5000 unit blocks, ensuring that any sort
of malloc overhead is ammortized over the individual objects to
almost nothing. Since most frontends don't know how many objects
they will need for a given import run (and its hard for them to
predict without just doing the run) we probably won't see anyone
using --objects. Further since there's really no major benefit
to using the option, most frontends won't even bother supplying
it even if they could estimate the number of objects. So I'm
removing it.
The --max-objects-per-pack option was probably a mistake to even
have added in the first place. The packfile format is limited
to 4 GiB today; given that objects need at least 3 bytes of data
(and probably need even more) there's no way we are going to exceed
the limit of 1<<32-1 objects before we reach the file size limit.
So I'm removing it (to slightly reduce the complexity of the code)
before anyone gets any wise ideas and tries to use it.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Currently the pack .idx file format uses 32-bit unsigned integers
for the fan-out table and the object offsets. We had previously
defined these as 'unsigned int', but not every system will define
that type to be a 32 bit value. To ensure maximum portability we
should always use 'uint32_t'.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Previously we were using 'unsigned int' to update the hdr_entries
field of the pack header after the file had been completed and
was being hashed. This may not be 32 bits on all platforms.
Instead we want to always uint32_t.
I'm actually cheating here by just using the pack_header like the
rest of Git and letting the struct definition declare the correct
type. Right now that field is still 'unsigned int' (wrong) but a
pending change submitted by Simon 'corecode' Schubert changes it
to uint32_t. After that change is merged in fast-import will do
the right thing all of the time.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Branches are only contained by a packfile if the branch actually
had its most recent commit in that packfile. So new branches are
set to MAX_PACK_ID to ensure they don't cause their commit to list
as part of the first packfile when it closes out if the commit was
actually in existance before fast-import started.
Also corrected the type of last_commit to be umaxint_t to prevent
overflow and wraparound on very large imports. Though that is
highly unlikely to occur as we're talking 4 billion commits, which
no real project has right now.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Apparently the git convention is to declare any function which
takes no arguments as taking void. I did not do this during the
early fast-import development, but should have.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The length of an atom string cannot be negative. So make it
explicit and declare it as an unsigned value.
The shift width in a mark table node also cannot be negative.
I'm also moving it to after the pointer arrays to prevent any
possible alignment problems on a 64 bit system.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Now that fast-import uses uintmax_t (the largest available unsigned
integer type) for marks we don't want to say its an unsigned 32
bit integer in ASCII base 10 notation. It could be much larger,
especially on 64 bit systems, and especially if a frontend uses
a very large number of marks (1 per file revision on a very, very
large import).
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
To help callers repack very large repositories into a series of
packfiles fast-import now outputs the last commits/tags it wrote to
a packfile when it prints out the packfile name. This information
can be feed to pack-objects --revs to repack. For the first pack
of an initial import this is pretty easy (just feed those SHA1s on
stdin) but for subsequent packs you want to feed the subsequent
pack's final SHA1s but also all prior pack's SHA1s prefixed with
the negation operator. This way the prior pack's data does not
get included into the subsequent pack.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Since object_count is limited to 'unsigned long' (really an
unsigned 32 bit integer value) by the pack file format we may as
well use exactly that type here in fast-import for that counter.
An earlier change by me incorrectly made it uintmax_t.
But since object_count is a counter for the current packfile only,
we don't want to output its value at the end. Instead we should
sum up the individual type counters and report that total, as that
will cover all of the packfiles.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Apparently amd64 has defined 'unsigned long' to be a 64 bit value,
which means -1 was way over the 4 GiB packfile limit. Whoops.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Much like the pack_sha1 the pack_fd is an unnecessary global
variable, we already have the fd stored in our struct packed_git
*pack_data so that the core library functions in sha1_file.c are
able to lookup and decompress object data that we have previously
written. Keeping an extra copy of this value in our own variable
is just a hold-over from earlier versions of fast-import and is
now completely unnecessary.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Because we are renaming the packfile into its file destination we
need to be sure its not open when the rename is called, otherwise
some operating systems (e.g. Windows) may prevent the rename from
occurring.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Because fast-import automatically updates all references (heads
and tags) at the end of its run the repository is corrupt unless
the objects are available in the .git/objects/pack directory prior
to the refs being modified. The easiest way to ensure that is true
is to move the packfile and its associated index directly into the
.git/objects/pack directory as soon as we have finished output to it.
But the only safe way to do this is to create the a temporary .keep
file for that pack, so we use the same tricks that index-pack uses
when its being invoked by receive-pack.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Rather than maintaing our own packfile level sha1 variable we
can make use of the one already available in struct packed_git.
Its meant for the SHA1 of the index but it can also hold the
SHA1 of the packfile itself between final checksumming of the
packfile and creation of the index.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Prior to git having read_in_full() fast-import used its own private
function yread to perform the header reading task. No sense in
keeping that around now that read_in_full is a public, stable
function.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If a frontend wants to use a mark per file revision and per commit
and is doing a truly huge import (such as a 32 GiB SVN repository)
we may need more than 2**32 unique mark values, especially if the
frontend is unable (or unwilling) to recycle mark values. For mark
idnums we should use the largest unsigned integer type available,
hoping that will be at least 64 bits when we are compiled as a 64
bit executable. This way we may consume huge amounts of memory
storing our mark table, but we'll at least be able to process
the entire import without failing.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If we previously were using a delta but we needed to checkpoint the
current packfile and switch to a new packfile we need to throw away
the delta and compress the raw object by itself, as delta chains
cannot span non-thin packfiles. Unfortunately the output buffer
in this case needs to grow, as the size of the compressed object
may be quite a bit larger than the size of the compressed delta.
I've also avoided recompressing the object if we are checkpointing
and we didn't use a delta. In this case the output buffer is the
correct size and has already been populated with the right data,
we just need to close out the current packfile and open a new one.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Caller scripts may want to know what packfiles the fast-import
process just wrote out for them. This is now output to stdout,
one packfile name per line, after we checkpoint each packfile.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When the number of objects or number of bytes gets close to the limit
allowed by the packfile format (or configured on the command line by
our caller) we should automatically checkpoint the current packfile
and start a new one before writing the object out. This does however
require that we abandon the delta (if we had one) as its not valid
in a new packfile.
I also added the simple rule that if we got a delta back but the
delta itself is the same size as or larger than the uncompressed
object to ignore the delta and just store the object data. This
should avoid some really bad behavior caused by our current delta
strategy.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When we are generating multiple packfiles at once we only need
to scan the blocks of object_entry structs which contain objects
for the current packfile. Because the most recent blocks are at
the front of the linked list, and because all new objects going
into the current file are allocated from the front of that list,
we can stop scanning for objects as soon as we identify one which
doesn't belong to the current packfile.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the last packfile is going to be empty (has 0 objects) then it
shouldn't be kept after the import has terminated, as there is no
point to the packfile. So rather than hashing it and making the
index file, just delete the packfile.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
To help importers which are dealing with massive amounts of data
fast-import needs to be able to close the packfile it is currently
writing to and open a new packfile for any additional data that
will be received. A new 'checkpoint' command has been introduced
which can be used by the frontend import process to force this
to occur at any time. This may be useful to ensure a very long
running import doesn't lose any work due to unexpected failures.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
There is little reason to be keeping a global duplicate_count
value when we also keep it per object type. The global counter can
easily be computed at the end, once all processing has completed.
This saves us a couple of machine instructions in an unimportant
part of code. But it looks slightly better to me to not keep
two counters around.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Now that we are starting to see some really large projects (such
as KDE or a fork of FreeBSD) get imported into Git we're running
into the upper limit on packfile object count as well as overall
byte length. The KDE and FreeBSD projects are both likely to
require more than 4 GiB to store their current history, which means
we really need multiple packfiles to handle their content.
This is a fairly simple restructuring of the internal code to help
us support creating multiple packfiles from within fast-import.
We are now adding a 5 digit incrementing suffix to the end of the
basename supplied to us by the caller, permitting up to 99,999
packs to be generated in a single fast-import run.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Now that the sha1_file.c library routines use the sliding mmap
routines to perform efficient access to portions of a packfile
I can remove that code from fast-import.c and just invoke it.
One benefit is we now have reloading support for any packfile which
uses OBJ_OFS_DELTA. Another is we have significantly less code
to maintain.
This code reuse change *requires* that fast-import generate only
an OBJ_OFS_DELTA format packfile, as there is absolutely no index
available to perform OBJ_REF_DELTA lookup in while unpacking
an object. This is probably reasonable to require as the delta
offsets result in smaller packfiles and are faster to unpack,
as no index searching is required. Its also only a temporary
requirement as users could always repack without offsets before
making the import available to older versions of Git.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
I'm bringing master in early so that the OBJ_OFS_DELTA implementation
is available as part of the topic. This way git-fast-import can
learn about this new slightly smaller and faster packfile format,
and can generate them directly rather than needing to have them be
repacked with git-pack-objects.
Due to the API changes in master during the period of development
of git-fast-import, a few minor tweaks to fast-import.c are needed
to produce a working merge. I've done them here as part of the
merge to ensure bisection always works.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some importers may want to create a branch long before they actually
commit to it, or in some cases they may never commit to the branch
but they still need the ref to be created in the repository after
the import is complete.
This extends the 'reset ' command to automatically create a new
branch if the supplied reference isn't already known as a branch.
While I'm at it I also modified the syntax of the reset command
to terminate with an empty line, like commit and tag operate.
This just makes the command set more consistent.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some importers are able to determine when branch merges occurred
within their source data. In these cases they will want to supply
the correct commits to fast-import so that a proper merge commit
will exist in Git. This is now supported by supplying a 'merge '
command after the commit message and optional from command.
A merge is not actually performed by fast-import, its assumed that
the frontend performed any sort of merging activity already and
that fast-import should simply be storing its result.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Apparently we did not copy the blob SHA1 into the stack variable
'sha1' when a mark is used to refer to a prior blob. This code
was not previously tested as the Mozilla CVS -> git-fast-import
program always fed us full SHA1s for modified blobs and did not
use the mark feature there.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The new tree delta implementation caused blob SHA1s to be used
instead of a tree SHA1 when a tree was written out. This really
only appeared to happen when converting an existing file to a tree,
but may have been possible in some other situations.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Since most commits and tag objects are around the same size and we
only generate one at a time we can reuse the same buffer rather than
xmalloc'ing and free'ing the buffer every time we generate a commit.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We only ever generate at most two tree streams at a time. Since most
trees are around the same size we can simply recycle the buffers from
one tree generation to the next rather than constantly xmalloc'ing
and free'ing them. This should perform slightly better when handling
a large number of trees as malloc has less work to do.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
We now store for every tree entry two modes and two sha1 values;
the base (aka "version 0") and the current/new (aka "version 1").
When we generate a tree object we also regenerate the prior version
object and use that as our base object for a delta. This strategy
saves a significant amount of memory as we can continue to use the
atom pool for file/directory names and only increases each tree
entry by an additional 24 bytes of memory.
Branches should automatically delta against their ancestor tree,
unless the ancestor tree is already at the delta chain limit.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Sometimes an import frontend may need to work with a temporary branch
which will actually contain many different branches over the life
of the import. This is especially useful when the frontend needs
to create a tag from a set of file versions which are otherwise
never a commit.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When generating a very large pack file (for example close to 1 GB
in size) it may be impossible for the kernel to find a contiguous
free range within a 32 bit address space for the mapping to be
located at. This is especially problematic on large imports where
there is a lot of malloc activity occuring within the same process
and the malloc'd regions may straddle the previously mapped regions,
thereby creating large holes in the address space.
So instead we map only 128 MB of the pack at any given time.
This will likely increase the number of times the file gets mapped
(with additional system time required to update the page tables
more frequently) but will allow the program to handle packs up to
4 GB in size.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
fast-import was encounting a GPF when it ran out of free tree_entry
objects but didn't know this was the cause because the last
tree_entry wasn't terminated with a NULL pointer. The missing NULL
pointer occurred when we allocated additional entries via xmalloc
but didn't set the last tree_entry's "next" pointer to NULL.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This option can be used to have a record of every commit, the mark
(if supplied) and branch name of the commit recorded into a log file
when the commit is generated. This log can be useful to verify the
results of an import as the commits can be compared to some source
repository matching commits through the mark value.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The marks table can be used by the frontend to load any commit after
the import and compare it to whatever data the frontend knows about
that commit. If the mark idnums can be easily correlated to some
reference source then its relatively trivial to compare the GIT
tree to the reference to verify the accuracy of the import.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
cvs2svn has three phases: begin_commit, middle_commit, end_commit.
The ancester is computed in the middle_commit phase. So its easier
to generate a stream if the from command appears after the commit
message itself but before the file change commands.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Jon Smirl was finding it difficult to alter cvs2svn to generate
branch commands prior to the first commit of the same branch.
This change moves the 'from' command to be an optional parameter of
the 'commit' command, thereby allowing a new branch to be defined
at the moment it gets used to create the first commit on that branch.
This change makes it impossible to create a branch with no commits
on it as at least one commit is needed to register the branch.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Some architectures (e.g. SPARC) would require that we access pointers
only on pointer-sized alignments. So ensure the pool allocator
rounds out non-pointer sized allocations to the next pointer so we
don't generate bad memory addresses. This could have occurred if
we had previously allocated an atom whose string was not a whole
multiple of the pointer size, for example.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Tree reloading allows fast-import to swap out the least-recently used
branch by simply deallocating the data structures from memory that
were associated with that branch. Later if the branch becomes active
again it can lazily recreate those structures on demand by reloading
the necessary trees from the pack file it originally wrote them to.
The reloading process is implemented by mmap'ing the pack into
memory and using a much tighter variant of the pack reading code
contained in sha1_file.c. This was a blatent copy from sha1_file.c
but the unpacking functions were significantly simplified and are
actually now in a form that should make it easier to map only the
necessary regions of a pack rather than the entire file.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Tags received from the frontend are generated in memory in a simple
linked list in the order that the tag commands were sent by the
frontend. If multiple different tag objects for the same tag name
get generated the last one sent by the frontend will be the one
that gets written out at termination. Multiple tag objects for
the same name will cause all older tags of the same name to be lost.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If the branch load count exceeds the number of branches created then
the frontend is causing fast-import to page branches into and out of
memory due to the way its ordering its commits. Performance can
likely be increased if the frontend were to alter its commit
sequence such that it stays on one branch before switching to another
branch, then never returns to the prior branch.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Marks are now saved when the mark directive gets used by the frontend
and may be used in place of a SHA1 expression to locate a previous
SHA1 which fast-import may have generated. This is particularly
useful with commits where the frontend does not (easily) have the
ability to compute the SHA1 for an arbitrary commit but needs it
to generate a branch or tag from that commit.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The following command line options are now accepted before the
pack name:
--objects=n # replaces the object count after the pack name
--depth=n # delta chain depth to use (default is 10)
--active-branches=n # maximum number of branches to keep in memory
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Growing a tree caused all subtrees to be deallocated and put back
into the free list yet those subtree's contents were still actively
in use. Consequently they were doled out again and got stomped
on elsewhere. Releasing a tree is now performed in two parts,
either releasing only the content array or releasing the content
array and recursively releasing the subtree(s).
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
If a frontend is smart enough to import a symlink then we should
let them do so. We'll assume that they were smart enough to first
generate a blob to hold the link target, as that's how symlinks
get represented in GIT.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Frontend clients can now send a text stream to fast-import rather
than a binary stream. This should facilitate developing frontend
software as the data stream is easier to view, manipulate and debug
my hand and Mark-I eyeball.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
When accepting revision SHA1 IDs from the frontend verify the SHA1
actually refers to a blob and is known to exist. Its an error
to use a SHA1 in a tree if the blob doesn't exist as this would
cause git-fsck-objects to report a missing blob should the pack get
closed without the blob being appended into it or a subsequent pack.
So right now we'll just ask that the frontend "pre-declare" any
blobs it wants to use in a tree before it can use them.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
The tree of the current commit can be altered by file_change commands
before the commit gets written to the pack. The file changes are
rather primitive as they simply allow removal of a tree entry or
setting/adding a tree entry.
Currently trees and commits aren't being deltafied when written to
the pack and branch reloading from the current pack doesn't work,
so at most 5 branches can be worked with at any one time.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
This provides the basic data structures needed to store trees in
memory while we are processing them for a branch. What we are
attempting to do is track one complete tree for each branch that
the frontend has registered with us through the 'newb' (new_branch)
command. When the frontend edits that tree through 'updf' or 'delf'
commands we'll mark the affected tree(s) as being dirty and recompute
their objects during 'comt' (commit).
Currently the protocol is decidedly _not_ user friendly. I crashed
fast-import by giving it bad input data from Perl. I may try to
improve upon it, or at least upon its error handling.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Moved the new_blob logic off into a new subroutine and
invoked it when getting the 'blob' command.
Added statistics dump to STDERR when the program terminates listing
what it did at a high level. This is somewhat interesting.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Too many globals variables were being used not not enough
code was resuable to process trees and commits so this is
a simple refactoring of the existing blob processing code
to get into a state that will be easier to handle trees
and commits in.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Although its easy to ask the user to tell us how many objects they
will need, its probably better to dynamically grow the object table
in large units. But if the user can give us a hint as to roughly
how many objects then we can still use it during startup.
Also stopped printing the SHA1 strings to stdout as no user is
currently making use of that facility.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>