Reduce duplicated code between convert.c and attr.c.
* lf/read-blob-data-from-index:
convert.c: remove duplicate code
read_blob_data_from_index(): optionally return the size of blob data
attr.c: extract read_index_data() as read_blob_data_from_index()
This allows for optionally getting the size of the returned data and
will be used in a follow-up patch.
Signed-off-by: Lukas Fleischer <git@cryptocrack.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Extract the read_index_data() function from attr.c and move it to
read-cache.c; rename it to read_blob_data_from_index() and update
the function signature of it to align better with index/cache API
functions.
This allows for reusing the function in convert.c later.
Signed-off-by: Lukas Fleischer <git@cryptocrack.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to keep track of what directory names are known to Git on
platforms with case insensitive filesystems can get confused upon
a hash collision between these pathnames and looped forever.
* kb/name-hash:
name-hash.c: fix endless loop with core.ignorecase=true
Update the index format documentation to mention the v4 format.
* nd/doc-index-format:
update-index: list supported idx versions and their features
read-cache.c: use INDEX_FORMAT_{LB,UB} in verify_hdr()
index-format.txt: mention of v4 is missing in some places
With core.ignorecase=true, name-hash.c builds a case insensitive index of
all tracked directories. Currently, the existing cache entry structures are
added multiple times to the same hashtable (with different name lengths and
hash codes). However, there's only one dir_next pointer, which gets
completely messed up in case of hash collisions. In the worst case, this
causes an endless loop if ce == ce->dir_next (see t7062).
Use a separate hashtable and separate structures for the directory index
so that each directory entry has its own next pointer. Use reference
counting to track which directory entry contains files.
There are only slight changes to the name-hash.c API:
- new free_name_hash() used by read_cache.c::discard_index()
- remove_name_hash() takes an additional index_state parameter
- index_name_exists() for a directory (trailing '/') may return a cache
entry that has been removed (CE_UNHASHED). This is not a problem as the
return value is only used to check if the directory exists (dir.c) or to
normalize casing of directory names (read-cache.c).
Getting rid of cache_entry.dir_next reduces memory consumption, especially
with core.ignorecase=false (which doesn't use that member at all).
With core.ignorecase=true, building the directory index is slightly faster
as we add / check the parent directory first (instead of going through all
directory levels for each file in the index). E.g. with WebKit (~200k
files, ~7k dirs), time spent in lazy_init_name_hash is reduced from 176ms
to 130ms.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
9d22778 (read-cache.c: write prefix-compressed names in the index -
2012-04-04) defined these. Interestingly, they were not used by
read-cache.c, or anywhere in that patch. They were used in
builtin/update-index.c later for checking supported index
versions. Use them here too.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Specifically the fields uid, gid, ctime, ino and dev are set to zero
by JGit. Other implementations, eg. Git in cygwin are allegedly also
somewhat incompatible with Git For Windows and on *nix platforms
the resolution of the timestamps may differ.
Any stat checking by git will then need to check content, which may
be very slow, particularly on Windows. Since mtime and size
is typically enough we should allow the user to tell git to avoid
checking these fields if they are set to zero in the index.
This change introduces a core.checkstat config option where the
the user can select to check all fields (default), or just size
and the whole second part of mtime (minimal).
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We do not want a link to 0{40} object stored anywhere in our objects.
* jk/maint-null-in-trees:
fsck: detect null sha1 in tree entries
do not write null sha1s to on-disk index
diff: do not use null sha1 as a sentinel value
Assignments to errno before calling system functions that used to
matter in the old code were left behind after the code structure
changed sufficiently to make them useless.
* nd/index-errno:
read_index_from: remove bogus errno assignments
These assignments comes from the very first commit e83c516 (Initial
revision of "git", the information manager from hell - 2005-04-07).
Back then we did not die() when errors happened so correct errno was
required.
Since 5d1a5c0 ([PATCH] Better error reporting for "git status" -
2005-10-01), read_index_from() learned to die rather than just return
-1 and these assignments became irrelevant. Remove them.
While at it, move die_errno() next to xmmap() call because it's the
mmap's error code that we care about. Otherwise if close(fd); fails,
it could overwrite mmap's errno.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We should never need to write the null sha1 into an index
entry (short of the 1 in 2^160 chance that somebody actually
has content that hashes to it). If we attempt to do so, it
is much more likely that it is a bug, since we use the null
sha1 as a sentinel value to mean "not valid".
The presence of null sha1s in the index (which can come
from, among other things, "update-index --cacheinfo", or by
reading a corrupted tree) can cause problems for later
readers, because they cannot distinguish the literal null
sha1 from its use a sentinel value. For example, "git
diff-files" on such an entry would make it appear as if it
is stat-dirty, and until recently, the diff code assumed
such an entry meant that we should be diffing a working tree
file rather than a blob.
Ideally, we would stop such entries from entering even our
in-core index. However, we do sometimes legitimately add
entries with null sha1s in order to represent these sentinel
situations; simply forbidding them in add_index_entry breaks
a lot of the existing code. However, we can at least make
sure that our in-core sentinel representation never makes it
to disk.
To be thorough, we will test an attempt to add both a blob
and a submodule entry. In the former case, we might run into
problems anyway because we will be missing the blob object.
But in the latter case, we do not enforce connectivity
across gitlink entries, making this our only point of
enforcement. The current implementation does not care which
type of entry we are seeing, but testing both cases helps
future-proof the test suite in case that changes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split lower bits of ce_flags field and creates a new ce_namelen
field in the in-core index structure.
* tg/ce-namelen-field:
Strip namelen out of ce_flags into a ce_namelen field
Even though the index can record pathnames longer than 1<<12 bytes,
in some places we were not comparing them in full, potentially
replacing index entries instead of adding.
* tg/maint-cache-name-compare:
cache_name_compare(): do not truncate while comparing paths
Strip the name length from the ce_flags field and move it
into its own ce_namelen field in struct cache_entry. This
will both give us a tiny bit of a performance enhancement
when working with long pathnames and is a refactoring for
more readability of the code.
It enhances readability, by making it more clear what
is a flag, and where the length is stored and make it clear
which functions use stages in comparisions and which only
use the length.
It also makes CE_NAMEMASK private, so that users don't
mistakenly write the name length in the flags.
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We failed to use ce_namelen() equivalent and instead only compared
up to the CE_NAMEMASK bytes by mistake. Adding an overlong path
that shares the same common prefix as an existing entry in the index
did not add a new entry, but instead replaced the existing one, as
the result.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace strlen(ce->name) with ce_namelen() in a couple
of places which gives us some additional bits of
performance.
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Trivially shrinks the on-disk size of the index file to save both I/O and
checksum overhead.
The topic should give a solid base to build on further updates, with the
code refactoring in its earlier parts, and the backward compatibility
mechanism in its later parts.
* jc/index-v4:
index-v4: document the entry format
unpack-trees: preserve the index file version of original
update-index: upgrade/downgrade on-disk index version
read-cache.c: write prefix-compressed names in the index
read-cache.c: read prefix-compressed names in index on-disk version v4
read-cache.c: move code to copy incore to ondisk cache to a helper function
read-cache.c: move code to copy ondisk to incore cache to a helper function
read-cache.c: report the header version we do not understand
read-cache.c: make create_from_disk() report number of bytes it consumed
read-cache.c: allow unaligned mapping of the index file
cache.h: hide on-disk index details
varint: make it available outside the context of pack
Teach the code to write the index in the v4 on-disk format.
Record the format version of the on-disk index we read from in the
index_state, and use the format when writing the new index out.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Because the entries are sorted by path, adjacent entries in the index tend
to share the leading components of them, and it makes sense to only store
the differences in later entries. In the v4 on-disk format of the index,
each on-disk cache entry stores the number of bytes to be stripped from
the end of the previous name, and the bytes to append to the result, to
come up with its name.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function is the one that is reading from the data stream. It only is
natural to make it responsible for reporting this number, not the caller.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the on-disk format v2 and v3 pads the "name" field to the multiple of
eight to make sure that various quantities in network long/short type can
be accessed with ntohl/ntohs without having to worry about alignment, but
this forces us to waste disk I/O bandwidth.
Introduce ntoh_s()/ntoh_l() macros that the callers can use as if they were
the regular ntohs()/ntohl() on a field that may not be aligned correctly.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The on-disk format of the index file is a detail whose implementation is
neatly encapsulated in read-cache.c; there is no need to expose it to the
general public that include the cache.h header file.
Also add a prominent mark to read-cache.c to delineate the parts that deal
with the index file I/O routines from the remainder of the file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The read-cache implementation defines this static function,
but it is a generally useful concept in git. Let's give
the empty blob the same treatment as the empty tree,
providing both hex and binary forms of the sha1.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When running "git add --refresh <pathspec>", we incorrectly showed the
path that is unmerged even if it is outside the specified pathspec, even
though we did honor pathspec and refreshed only the paths that matched.
Note that this cange does not affect "git update-index --refresh"; for
hysterical raisins, it does not take a pathspec (it takes real paths) and
more importantly itss command line options are parsed and executed one by
one as they are encountered, so "git update-index --refresh foo" means
"first refresh the index, and then update the entry 'foo' by hashing the
contents in file 'foo'", not "refresh only entry 'foo'".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* rs/allocate-cache-entry-individually:
cache.h: put single NUL at end of struct cache_entry
read-cache.c: allocate index entries individually
Conflicts:
read-cache.c
If you have a deleted file and a porcelain refreshes the
cache, we print:
Unstaged changes after reset:
M file
This is technically correct, in that the file is modified,
but it's friendlier to the user if we further differentiate
the case of a deleted file (especially because this output
looks a lot like "diff --name-status", which would also make
the distinction).
Similarly, we can distinguish typechanges ("T") and
intent-to-add files ("A"), both of which appear as just "M"
in the current output.
The plumbing output for all cases remains "needs update" for
historical compatibility.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When refreshing the index, for modified (or unmerged) files we will print
"needs update" (or "needs merge") for plumbing, or line similar to the
output from "diff --name-status" for porcelain.
The variables holding which type of message to show are named after the
plumbing messages. However, as we begin to differentiate more cases at the
porcelain level (with the plumbing message staying the same), that naming
scheme will become awkward.
Instead, name the variables after which case we found (modified or
unmerged), not what we will output.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This will enable refresh_cache to differentiate more cases
of modification (such as typechange) when telling the user
what isn't fresh.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to estimate the in-memory size of the index based on its on-disk
representation is subtly wrong for certain architecture-dependent struct
layouts. Instead of fixing it, replace the code to keep the index entries
in a single large block of memory and allocate each entry separately
instead. This is both simpler and more flexible, as individual entries
can now be freed. Actually using that added flexibility is left for a
later patch.
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
estimate_cache_size() tries to guess how much memory is needed for the
in-memory representation of an index file. It does that by using the
file size, the number of entries and the difference of the sizes of the
on-disk and in-memory structs -- without having to check the length of
the name of each entry, which varies for each entry, but their sums are
the same no matter the representation.
Except there can be a difference. First of all, the size is really
calculated by ce_size and ondisk_ce_size based on offsetof(..., name),
not sizeof, which can be different. And entries are padded with 1 to 8
NULs at the end (after the variable name) to make their total length a
multiple of eight.
So in order to allocate enough memory to hold the index, change the
delta calculation to be based on offsetof(..., name) and round up to
the next multiple of eight.
On a 32-bit Linux, this delta was used before:
sizeof(struct cache_entry) == 72
sizeof(struct ondisk_cache_entry) == 64
---
8
The actual difference for an entry with a filename length of one was,
however (find the definitions are in cache.h):
offsetof(struct cache_entry, name) == 72
offsetof(struct ondisk_cache_entry, name) == 62
ce_size == (72 + 1 + 8) & ~7 == 80
ondisk_ce_size == (62 + 1 + 8) & ~7 == 64
---
16
So eight bytes less had been allocated for such entries. The new
formula yields the correct delta:
(72 - 62 + 7) & ~7 == 16
Reported-by: John Hsing <tsyj2007@gmail.com>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ef/maint-win-verify-path:
verify_dotfile(): do not assume '/' is the path seperator
verify_path(): simplify check at the directory boundary
verify_path: consider dos drive prefix
real_path: do not assume '/' is the path seperator
A Windows path starting with a backslash is absolute
verify_dotfile() currently assumes that the path seperator is '/', but on
Windows it can also be '\\', so use is_dir_sep() instead.
Signed-off-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We simply want to say "At a directory boundary, be careful with a name
that begins with a dot, forbid a name that ends with the boundary
character or has duplicated bounadry characters".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If someone manage to create a repo with a 'C:' entry in the
root-tree, files can be written outside of the working-dir. This
opens up a can-of-worms of exploits.
Fix it by explicitly checking for a dos drive prefix when verifying
a paht. While we're at it, make sure that paths beginning with '\' is
considered absolute as well.
Noticed-by: Theo Niessink <theo@taletn.com>
Signed-off-by: Erik Faye-Lund <kusmabite@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "format_check" parameter tucked after the existing parameters is too
ugly an afterthought to live in any reasonable API.
Combine it with the other boolean parameter "write_object" into a single
"flags" parameter.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Traditional "opportunistic index update" done by read-only "diff" and
"status" was about updating cached lstat(2) information in the index for
the next round. We missed another obvious optimization opportunity: when
there are racily clean entries that will cease to be racily clean by
updating $GIT_INDEX_FILE. Detect that case and write $GIT_INDEX_FILE out
to give it a newer timestamp.
Noticed by Lasse Makholm by stracing "git status" in a fresh checkout and
counting the number of open(2) calls.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we had to refresh the index internally before running diff or status,
we opportunistically updated the $GIT_INDEX_FILE so that later invocation
of git can use the lstat(2) we already did in this invocation.
Make them share a helper function to do so.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* nd/struct-pathspec: (22 commits)
t6004: add pathspec globbing test for log family
t7810: overlapping pathspecs and depth limit
grep: drop pathspec_matches() in favor of tree_entry_interesting()
grep: use writable strbuf from caller for grep_tree()
grep: use match_pathspec_depth() for cache/worktree grepping
grep: convert to use struct pathspec
Convert ce_path_match() to use match_pathspec_depth()
Convert ce_path_match() to use struct pathspec
struct rev_info: convert prune_data to struct pathspec
pathspec: add match_pathspec_depth()
tree_entry_interesting(): optimize wildcard matching when base is matched
tree_entry_interesting(): support wildcard matching
tree_entry_interesting(): fix depth limit with overlapping pathspecs
tree_entry_interesting(): support depth limit
tree_entry_interesting(): refactor into separate smaller functions
diff-tree: convert base+baselen to writable strbuf
glossary: define pathspec
Move tree_entry_interesting() to tree-walk.c and export it
tree_entry_interesting(): remove dependency on struct diff_options
Convert struct diff_options to use struct pathspec
...