git-commit-vandalism

Author	SHA1	Message	Date
Junio C Hamano	7f109ef54e	Merge branch 'ks/pack-objects-bitmap' Some codepaths in "git pack-objects" were not ready to use an existing pack bitmap; now they are and as the result they have become faster. * ks/pack-objects-bitmap: pack-objects: use reachability bitmap index when generating non-stdout pack pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use	2016-09-21 15:15:21 -07:00
Junio C Hamano	9883ec2c73	Merge branch 'jk/pack-tag-of-tag' "git pack-objects --include-tag" was taught that when we know that we are sending an object C, we want a tag B that directly points at C but also a tag A that points at the tag B. We used to miss the intermediate tag B in some cases. * jk/pack-tag-of-tag: pack-objects: walk tag chains for --include-tag t5305: simplify packname handling t5305: use "git -C" t5305: drop "dry-run" of unpack-objects t5305: move cleanup into test block	2016-09-15 14:11:14 -07:00
Kirill Smelkov	645c432d61	pack-objects: use reachability bitmap index when generating non-stdout pack Starting from `6b8fda2d` (pack-objects: use bitmaps when packing objects) if a repository has bitmap index, pack-objects can nicely speedup "Counting objects" graph traversal phase. That however was done only for case when resultant pack is sent to stdout, not written into a file. The reason here is for on-disk repack by default we want: - to produce good pack (with bitmap index not-yet-packed objects are emitted to pack in suboptimal order). - to use more robust pack-generation codepath (avoiding possible bugs in bitmap code and possible bitmap index corruption). Jeff King further explains: The reason for this split is that pack-objects tries to determine how "careful" it should be based on whether we are packing to disk or to stdout. Packing to disk implies "git repack", and that we will likely delete the old packs after finishing. We want to be more careful (so as not to carry forward a corruption, and to generate a more optimal pack), and we presumably run less frequently and can afford extra CPU. Whereas packing to stdout implies serving a remote via "git fetch" or "git push". This happens more frequently (e.g., a server handling many fetching clients), and we assume the receiving end takes more responsibility for verifying the data. But this isn't always the case. One might want to generate on-disk packfiles for a specialized object transfer. Just using "--stdout" and writing to a file is not optimal, as it will not generate the matching pack index. So it would be useful to have some way of overriding this heuristic: to tell pack-objects that even though it should generate on-disk files, it is still OK to use the reachability bitmaps to do the traversal. So we can teach pack-objects to use bitmap index for initial object counting phase when generating resultant pack file too: - if we take care to not let it be activated under git-repack: See above about repack robustness and not forward-carrying corruption. - if we know bitmap index generation is not enabled for resultant pack: The current code has singleton bitmap_git, so it cannot work simultaneously with two bitmap indices. We also want to avoid (at least with current implementation) generating bitmaps off of bitmaps. The reason here is: when generating a pack, not-yet-packed objects will be emitted into pack in suboptimal order and added to tail of the bitmap as "extended entries". When the resultant pack + some new objects in associated repository are in turn used to generate another pack with bitmap, the situation repeats: new objects are again not emitted optimally and just added to bitmap tail - not in recency order. So the pack badness can grow over time when at each step we have bitmapped pack + some other objects. That's why we want to avoid generating bitmaps off of bitmaps, not to let pack badness grow. - if we keep pack reuse enabled still only for "send-to-stdout" case: Because pack-to-file needs to generate index for destination pack, and currently on pack reuse raw entries are directly written out to the destination pack by write_reused_pack(), bypassing needed for pack index generation bookkeeping done by regular codepath in write_one() and friends. ( In the future we might teach pack-reuse code about cases when index also needs to be generated for resultant pack and remove pack-reuse-only-for-stdout limitation ) This way for pack-objects -> file we get nice speedup: erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup repository managed by git-backup[2] via time echo 0186ac99 \| git pack-objects --revs erp5pack before: 37.2s after: 26.2s And for `git repack -adb` packed git.git time echo `5c589a73` \| git pack-objects --revs gitpack before: 7.1s after: 3.6s i.e. it can be 30% - 50% speedup for pack extraction. git-backup extracts many packs on repositories restoration. That was my initial motivation for the patch. [1] https://lab.nexedi.com/nexedi/erp5 [2] https://lab.nexedi.com/kirr/git-backup NOTE Jeff also suggests that pack.useBitmaps was probably a mistake to introduce originally. This way we are not adding another config point, but instead just always default to-file pack-objects not to use bitmap index: Tools which need to generate on-disk packs with using bitmap, can pass --use-bitmap-index explicitly. And git-repack does never pass --use-bitmap-index, so this way we can be sure regular on-disk repacking remains robust. NOTE2 `git pack-objects --stdout >file.pack` + `git index-pack file.pack` is much slower than `git pack-objects file.pack`. Extracting erp5.git pack from lab.nexedi.com backup repository: $ time echo 0186ac99 \| git pack-objects --stdout --revs >erp5pack-stdout.pack real 0m22.309s user 0m21.148s sys 0m0.932s $ time git index-pack erp5pack-stdout.pack real 0m50.873s <-- more than 2 times slower than time to generate pack itself! user 0m49.300s sys 0m1.360s So the time for `pack-object --stdout >file.pack` + `index-pack file.pack` is 72s, while `pack-objects file.pack` which does both pack and index is 27s. And even `pack-objects --no-use-bitmap-index file.pack` is 37s. Jeff explains: The packfile does not carry the sha1 of the objects. A receiving index-pack has to compute them itself, including inflating and applying all of the deltas. that's why for `git-backup restore` we want to teach `git pack-objects file.pack` to use bitmaps instead of using `git pack-objects --stdout >file.pack` + `git index-pack file.pack`. NOTE3 The speedup is now tracked via t/perf/p5310-pack-bitmaps.sh Test `56dfeb62` this tree -------------------------------------------------------------------------------- 5310.2: repack to disk 8.98(8.05+0.29) 9.05(8.08+0.33) +0.8% 5310.3: simulated clone 2.02(2.27+0.09) 2.01(2.25+0.08) -0.5% 5310.4: simulated fetch 0.81(1.07+0.02) 0.81(1.05+0.04) +0.0% 5310.5: pack to file 7.58(7.04+0.28) 7.60(7.04+0.30) +0.3% 5310.6: pack to file (bitmap) 7.55(7.02+0.28) 3.25(2.82+0.18) -57.0% 5310.8: clone (partial bitmap) 1.83(2.26+0.12) 1.82(2.22+0.14) -0.5% 5310.9: pack to file (partial bitmap) 6.86(6.58+0.30) 2.87(2.74+0.20) -58.2% More context: http://marc.info/?t=146792101400001&r=1&w=2 http://public-inbox.org/git/20160707190917.20011-1-kirr@nexedi.com/T/#t Cc: Vicent Marti <tanoku@gmail.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-12 13:47:41 -07:00
Kirill Smelkov	702d1b9583	pack-objects: respect --local/--honor-pack-keep/--incremental when bitmap is in use Since `6b8fda2d` (pack-objects: use bitmaps when packing objects) there are two codepaths in pack-objects: with & without using bitmap reachability index. However add_object_entry_from_bitmap(), despite its non-bitmapped counterpart add_object_entry(), in no way does check for whether --local or --honor-pack-keep or --incremental should be respected. In non-bitmapped codepath this is handled in want_object_in_pack(), but bitmapped codepath has simply no such checking at all. The bitmapped codepath however was allowing to pass in all those options and with bitmap indices still being used under such conditions - potentially giving wrong output (e.g. including objects from non-local or .keep'ed pack). We can easily fix this by noting the following: when an object comes to add_object_entry_from_bitmap() it can come for two reasons: 1. entries coming from main pack covered by bitmap index, and 2. object coming from, possibly alternate, loose or other packs. "2" can be already handled by want_object_in_pack() and to cover "1" we can teach want_object_in_pack() to expect that *found_pack can be non-NULL, meaning calling client already found object's pack entry. In want_object_in_pack() we care to start the checks from already found pack, if we have one, this way determining the answer right away in case neither --local nor --honour-pack-keep are active. In particular, as p5310-pack-bitmaps.sh shows (3 consecutive runs), we do not do harm to served-with-bitmap clones performance-wise: Test `56dfeb62` this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.08(8.20+0.25) 9.09(8.14+0.32) +0.1% 5310.3: simulated clone 1.92(2.12+0.08) 1.93(2.12+0.09) +0.5% 5310.4: simulated fetch 0.82(1.07+0.04) 0.82(1.06+0.04) +0.0% 5310.6: partial bitmap 1.96(2.42+0.13) 1.95(2.40+0.15) -0.5% Test `56dfeb62` this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.11(8.16+0.32) 9.11(8.19+0.28) +0.0% 5310.3: simulated clone 1.93(2.14+0.07) 1.92(2.11+0.10) -0.5% 5310.4: simulated fetch 0.82(1.06+0.04) 0.82(1.04+0.05) +0.0% 5310.6: partial bitmap 1.95(2.38+0.16) 1.94(2.39+0.14) -0.5% Test `56dfeb62` this tree ----------------------------------------------------------------- 5310.2: repack to disk 9.13(8.17+0.31) 9.07(8.13+0.28) -0.7% 5310.3: simulated clone 1.92(2.13+0.07) 1.91(2.12+0.06) -0.5% 5310.4: simulated fetch 0.82(1.08+0.03) 0.82(1.08+0.03) +0.0% 5310.6: partial bitmap 1.96(2.43+0.14) 1.96(2.42+0.14) +0.0% with delta timings showing they are all within noise from run to run. In the general case we do not want to call find_pack_entry_one() more than once, because it is expensive. This patch splits the loop in want_object_in_pack() into two parts: finding the object and seeing if it impacts our choice to include it in the pack. We may call the inexpensive want_found_object() twice, but we will never call find_pack_entry_one() if we do not need to. I appreciate help and discussing this change with Junio C Hamano and Jeff King. Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-12 13:47:41 -07:00
Jeff King	b773ddea2c	pack-objects: walk tag chains for --include-tag When pack-objects is given --include-tag, it peels each tag ref down to a non-tag object, and if that non-tag object is going to be packed, we include the tag, too. But what happens if we have a chain of tags (e.g., tag "A" points to tag "B", which points to commit "C")? We'll peel down to "C" and realize that we want to include tag "A", but we do not ever consider tag "B", leading to a broken pack (assuming "B" was not otherwise selected). Instead, we have to walk the whole chain, adding any tags we find to the pack. Interestingly, it doesn't seem possible to trigger this problem with "git fetch", but you can with "git clone --single-branch". The reason is that we generate the correct pack when the client explicitly asks for "A" (because we do a real reachability analysis there), and "fetch" is more willing to do so. There are basically two cases: 1. If "C" is already a ref tip, then the client can deduce that it needs "A" itself (via find_non_local_tags), and will ask for it explicitly rather than relying on the include-tag capability. Everything works. 2. If "C" is not already a ref tip, then we hope for include-tag to send us the correct tag. But it doesn't; it generates a broken pack. However, the next step is to do a follow-up run of find_non_local_tags(), followed by fetch_refs() to backfill any tags we learned about. In the normal case, fetch_refs() calls quickfetch(), which does a connectivity check and sees we have no new objects to fetch. We just write the refs. But for the broken-pack case, the connectivity check fails, and quickfetch will follow-up with the remote, asking explicitly for each of the ref tips. This picks up the missing object in a new pack. For a regular "git clone", we are similarly OK, because we explicitly request all of the tag refs, and get a correct pack. But with "--single-branch", we kick in tag auto-following via "include-tag", but do _not_ do a follow-up backfill. We just take whatever the server sent us via include-tag and write out tag refs for any tag objects we were sent. So prior to `c6807a4` (clone: open a shortcut for connectivity check, 2013-05-26), we actually claimed the clone was a success, but the result was silently corrupted! Since `c6807a4`, index-pack's connectivity check catches this case, and we correctly complain. The included test directly checks that pack-objects does not generate a broken pack, but also confirms that "clone --single-branch" does not hit the bug. Note that tag chains introduce another interesting question: if we are packing the tag "B" but not the commit "C", should "A" be included? Both before and after this patch, we do not include "A", because the initial peel_ref() check only knows about the bottom-most level, "C". To realize that "B" is involved at all, we would have to switch to an incremental peel, in which we examine each tagged object, asking if it is being packed (and including the outer tag if so). But that runs contrary to the optimizations in peel_ref(), which avoid accessing the objects at all, in favor of using the value we pull from packed-refs. It's OK to walk the whole chain once we know we're going to include the tag (we have to access it anyway, so the effort is proportional to the pack we're generating). But for the initial selection, we have to look at every ref. If we're only packing a few objects, we'd still have to parse every single referenced tag object just to confirm that it isn't part of a tag chain. This could be addressed if packed-refs stored the complete tag chain for each peeled ref (in most cases, this would be the same cost as now, as each "chain" is only a single link). But given the size of that project, it's out of scope for this fix (and probably nobody cares enough anyway, as it's such an obscure situation). This commit limits itself to just avoiding the creation of a broken pack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-07 11:45:31 -07:00
Jeff King	c9af708b1a	pack-objects: use mru list when iterating over packs In the original implementation of want_object_in_pack(), we always looked for the object in every pack, so the order did not matter for performance. As of the last few patches, however, we can now often break out of the loop early after finding the first instance, and avoid looking in the other packs at all. In this case, pack order can make a big difference, because we'd like to find the objects by looking at as few packs as possible. This patch switches us to the same packed_git_mru list that is now used by normal object lookups. Here are timings for p5303 on linux.git: Test HEAD^ HEAD ------------------------------------------------------------------------ 5303.3: rev-list (1) 31.31(31.07+0.23) 31.28(31.00+0.27) -0.1% 5303.4: repack (1) 40.35(38.84+2.60) 40.53(39.31+2.32) +0.4% 5303.6: rev-list (50) 31.37(31.15+0.21) 31.41(31.16+0.24) +0.1% 5303.7: repack (50) 58.25(68.54+2.03) 47.28(57.66+1.89) -18.8% 5303.9: rev-list (1000) 31.91(31.57+0.33) 31.93(31.64+0.28) +0.1% 5303.10: repack (1000) 304.80(376.00+3.92) 87.21(159.54+2.84) -71.4% The rev-list numbers are unchanged, which makes sense (they are not exercising this code at all). The 50- and 1000-pack repack cases show considerable improvement. The single-pack repack case doesn't, of course; there's nothing to improve. In fact, it gives us a baseline for how fast we could possibly go. You can see that though rev-list can approach the single-pack case even with 1000 packs, repack doesn't. The reason is simple: the loop we are optimizing is only part of what the repack is doing. After the "counting" phase, we do delta compression, which is much more expensive when there are multiple packs, because we have fewer deltas we can reuse (you can also see that these numbers come from a multicore machine; the CPU times are much higher than the wall-clock times due to the delta phase). So the good news is that in cases with many packs, we used to be dominated by the "counting" phase, and now we are dominated by the delta compression (which is faster, and which we have already parallelized). Here are similar numbers for git.git: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.3: rev-list (1) 1.55(1.51+0.02) 1.54(1.53+0.00) -0.6% 5303.4: repack (1) 1.82(1.80+0.08) 1.82(1.78+0.09) +0.0% 5303.6: rev-list (50) 1.58(1.57+0.00) 1.58(1.56+0.01) +0.0% 5303.7: repack (50) 2.50(3.12+0.07) 2.31(2.95+0.06) -7.6% 5303.9: rev-list (1000) 2.22(2.20+0.02) 2.23(2.19+0.03) +0.5% 5303.10: repack (1000) 10.47(16.78+0.22) 7.50(13.76+0.22) -28.4% Not as impressive in terms of percentage, but still measurable wins. If you look at the wall-clock time improvements in the 1000-pack case, you can see that linux improved by roughly 10x as many seconds as git. That's because it has roughly 10x as many objects, and we'd expect this improvement to scale linearly with the number of objects (since the number of packs is kept constant). It's just that the "counting" phase is a smaller percentage of the total time spent for a git.git repack, and hence the percentage win is smaller. The implementation itself is a straightforward use of the MRU code. We only bother marking a pack as used when we know that we are able to break early out of the loop, for two reasons: 1. If we can't break out early, it does no good; we have to visit each pack anyway, so we might as well avoid even the minor overhead of managing the cache order. 2. The mru_mark() function reorders the list, which would screw up our traversal. So it is only safe to mark when we are about to break out of the loop. We could record the found pack and mark it after the loop finishes, of course, but that's more complicated and it doesn't buy us anything due to (1). Note that this reordering does have a potential impact on the final pack, as we store only a single "found" pack for each object, even if it is present in multiple packs. In principle, any copy is acceptable, as they all refer to the same content. But in practice, they may differ in whether they are stored as deltas, against which base, etc. This may have an impact on delta reuse, and even the delta search (since we skip pairs that were already in the same pack). It's not clear whether this change of order would hurt or even help average cases, though. The most likely reason to have duplicate objects is from the completion of thin packs (e.g., you have some objects in a "base" pack, then receive several pushes; the packs you receive may be thin on the wire, with deltas that refer to bases outside the pack, but we complete them with duplicate base objects when indexing them). In such a case the current code would always find the thin duplicates (because we currently walk the packs in reverse chronological order). Whereas with this patch, some of those duplicates would be found in the base pack instead. In my tests repacking a real-world case of linux.git with 3600 thin-pack pushes (on top of a large "base" pack), the resulting pack was about 0.04% larger with this patch. On the other hand, because we were more likely to hit the base pack, there were more opportunities for delta reuse, and we had 50,000 fewer objects to examine in the delta search. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:44:23 -07:00
Jeff King	4cf2143e02	pack-objects: break delta cycles before delta-search phase We do not allow cycles in the delta graph of a pack (i.e., A is a delta of B which is a delta of A) for the obvious reason that you cannot actually access any of the objects in such a case. There's a last-ditch attempt to notice cycles during the write phase, during which we issue a warning to the user and write one of the objects out in full. However, this is "last-ditch" for two reasons: 1. By this time, it's too late to find another delta for the object, so the resulting pack is larger than it otherwise could be. 2. The warning is there because this is something that _shouldn't_ ever happen. If it does, then either: a. a pack we are reusing deltas from had its own cycle b. we are reusing deltas from multiple packs, and we found a cycle among them (i.e., A is a delta of B in one pack, but B is a delta of A in another, and we choose to use both deltas). c. there is a bug in the delta-search code So this code serves as a final check that none of these things has happened, warns the user, and prevents us from writing a bogus pack. Right now, (2b) should never happen because of the static ordering of packs in want_object_in_pack(). If two objects have a delta relationship, then they must be in the same pack, and therefore we will find them from that same pack. However, a future patch would like to change that static ordering, which will make (2b) a common occurrence. In preparation, we should be able to handle those kinds of cycles better. This patch does by introducing a cycle-breaking step during the get_object_details() phase, when we are deciding which deltas can be reused. That gives us the chance to feed the objects into the delta search as if the cycle did not exist. We'll leave the detection and warning in the write_object() phase in place, as it still serves as a check for case (2c). This does mean we will stop warning for (2a). That case is caused by bogus input packs, and we ideally would warn the user about it. However, since those cycles show up after picking reusable deltas, they look the same as (2b) to us; our new code will break the cycles early and the last-ditch check will never see them. We could do analysis on any cycles that we find to distinguish the two cases (i.e., it is a bogus pack if and only if every delta in the cycle is in the same pack), but we don't need to. If there is a cycle inside a pack, we'll run into problems not only reusing the delta, but accessing the object data at all. So when we try to dig up the actual size of the object, we'll hit that same cycle and kick in our usual complain-and-try-another-source code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:44:13 -07:00
Junio C Hamano	78849622ec	Merge branch 'jk/pack-objects-optim' "git pack-objects" has a few options that tell it not to pack objects found in certain packfiles, which require it to scan .idx files of all available packs. The codepaths involved in these operations have been optimized for a common case of not having any non-local pack and/or any .kept pack. * jk/pack-objects-optim: pack-objects: compute local/ignore_pack_keep early pack-objects: break out of want_object loop early find_pack_entry: replace last_found_pack with MRU cache add generic most-recently-used list sha1_file: drop free_pack_by_name t/perf: add tests for many-pack scenarios	2016-08-08 14:48:39 -07:00
Junio C Hamano	aa9136a87e	Merge branch 'nd/pack-ofs-4gb-limit' into maint "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-08-08 14:21:36 -07:00
Jeff King	56dfeb6263	pack-objects: compute local/ignore_pack_keep early In want_object_in_pack(), we can exit early from our loop if neither "local" nor "ignore_pack_keep" are set. If they are, however, we must examine each pack to see if it has the object and is non-local or has a ".keep". It's quite common for there to be no non-local or .keep packs at all, in which case we know ahead of time that looking further will be pointless. We can pre-compute this by simply iterating over the list of packs ahead of time, and dropping the flags if there are no packs that could match. Another similar strategy would be to modify the loop in want_object_in_pack() to notice that we have already found the object once, and that we are looping only to check for "local" and "keep" attributes. If a pack has neither of those, we can skip the call to find_pack_entry_one(), which is the expensive part of the loop. This has two advantages: - it isn't all-or-nothing; we still get some improvement when there's a small number of kept or non-local packs, and a large number of non-kept local packs - it eliminates any possible race where we add new non-local or kept packs after our initial scan. In practice, I don't think this race matters; we already cache the packed_git information, so somebody who adds a new pack or .keep file after we've started will not be noticed at all, unless we happen to need to call reprepare_packed_git() because a lookup fails. In other words, we're already racy, and the race is not a big deal (losing the race means we might include an object in the pack that would not otherwise be, which is an acceptable outcome). However, it also has a disadvantage: we still loop over the rest of the packs for each object to check their flags. This is much less expensive than doing the object lookup, but still not free. So if we wanted to implement that strategy to cover the non-all-or-nothing cases, we could do so in addition to this one (so you get the most speedup in the all-or-nothing case, and the best we can do in the other cases). But given that the all-or-nothing case is likely the most common, it is probably not worth the trouble, and we can revisit this later if evidence points otherwise. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:08 -07:00
Jeff King	cd37996795	pack-objects: break out of want_object loop early When pack-objects collects the list of objects to pack (either from stdin, or via its internal rev-list), it filters each one through want_object_in_pack(). This function loops through each existing packfile, looking for the object. When we find it, we mark the pack/offset combo for later use. However, we can't just return "yes, we want it" at that point. If --honor-pack-keep is in effect, we must keep looking to find it in _all_ packs, to make sure none of them has a .keep. Likewise, if --local is in effect, we must make sure it is not present in any non-local pack. As a result, the sum effort of these calls is effectively O(nr_objects * nr_packs). In an ordinary repository, we have only a handful of packs, and this doesn't make a big difference. But in pathological cases, it can slow the counting phase to a crawl. This patch notices the case that we have neither "--local" nor "--honor-pack-keep" in effect and breaks out of the loop early, after finding the first instance. Note that our worst case is still "objects * packs" (i.e., we might find each object in the last pack we look in), but in practice we will often break out early. On an "average" repo, my git.git with 8 packs, this shows a modest 2% (a few dozen milliseconds) improvement in the counting-objects phase of "git pack-objects --all <foo" (hackily instrumented by sticking exit(0) right after list_objects). But in a much more pathological case, it makes a bigger difference. I ran the same command on a real-world example with ~9 million objects across 1300 packs. The counting time dropped from 413s to 45s, an improvement of about 89%. Note that this patch won't do anything by itself for a normal "git gc", as it uses both --honor-pack-keep and --local. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:07 -07:00
Junio C Hamano	ad2d777604	Merge branch 'nd/pack-ofs-4gb-limit' "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-07-28 10:34:42 -07:00
Nguyễn Thái Ngọc Duy	af92a645d3	pack-objects: do not truncate result in-pack object size on 32-bit systems A typical diff will not show what's going on and you need to see full functions. The core code is like this, at the end of of write_one() e->idx.offset = offset; size = write_object(f, e, offset); if (!size) { e->idx.offset = recursing; return WRITE_ONE_BREAK; } written_list[nr_written++] = &e->idx; /* make sure off_t is sufficiently large not to wrap / if (signed_add_overflows(offset, size)) die("pack too large for current definition of off_t"); *offset += size; Here we can see that the in-pack object size is returned by write_object (or indirectly by write_reuse_object). And it's used to calculate object offsets, which end up in the pack index file, generated at the end. If "size" overflows (on 32-bit sytems, unsigned long is 32-bit while off_t can be 64-bit), we got wrong offsets and produce incorrect .idx file, which may make it look like the .pack file is corrupted. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-13 09:15:17 -07:00
Nguyễn Thái Ngọc Duy	211c61c6cf	pack-objects: pass length to check_pack_crc() without truncation On 32 bit systems with large file support, unsigned long is 32-bit while the two offsets in the subtraction expression (pack-objects has the exact same expression as in sha1_file.c but not shown in diff) are in 64-bit. If an in-pack object is larger than 2^32 len/datalen is truncated and we get a misleading "error: bad packed object CRC for ..." as a result. Use off_t for len and datalen. check_pack_crc() already accepts this argument as off_t and can deal with 4+ GB. Noticed-by: Christoph Michelbach <michelbach94@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-12 10:14:29 -07:00
Jeff King	e26a8c4721	repack: extend --keep-unreachable to loose objects If you use "repack -adk" currently, we will pack all objects that are already packed into the new pack, and then drop the old packs. However, loose unreachable objects will be left as-is. In theory these are meant to expire eventually with "git prune". But if you are using "repack -k", you probably want to keep things forever and therefore do not run "git prune" at all. Meaning those loose objects may build up over time and end up fooling any object-count heuristics (such as the one done by "gc --auto", though since git-gc does not support "repack -k", this really applies to whatever custom scripts people might have driving "repack -k"). With this patch, we instead stuff any loose unreachable objects into the pack along with the already-packed unreachable objects. This may seem wasteful, but it is really no more so than using "repack -k" in the first place. We are at a slight disadvantage, in that we have no useful ordering for the result, or names to hand to the delta code. However, this is again no worse than what "repack -k" is already doing for the packed objects. The packing of these objects doesn't matter much because they should not be accessed frequently (unless they actually _do_ become referenced, but then they would get moved to a different part of the packfile during the next repack). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-06-14 13:57:45 -07:00
Junio C Hamano	40cfc95856	Merge branch 'nd/error-errno' The code for warning_errno/die_errno has been refactored and a new error_errno() reporting helper is introduced. * nd/error-errno: (41 commits) wrapper.c: use warning_errno() vcs-svn: use error_errno() upload-pack.c: use error_errno() unpack-trees.c: use error_errno() transport-helper.c: use error_errno() sha1_file.c: use {error,die,warning}_errno() server-info.c: use error_errno() sequencer.c: use error_errno() run-command.c: use error_errno() rerere.c: use error_errno() and warning_errno() reachable.c: use error_errno() mailmap.c: use error_errno() ident.c: use warning_errno() http.c: use error_errno() and warning_errno() grep.c: use error_errno() gpg-interface.c: use error_errno() fast-import.c: use error_errno() entry.c: use error_errno() editor.c: use error_errno() diff-no-index.c: use error_errno() ...	2016-05-17 14:38:28 -07:00
Junio C Hamano	54c2af5aa3	Merge branch 'ew/doc-split-pack-disables-bitmap' Doc update. * ew/doc-split-pack-disables-bitmap: pack-objects: warn on split packs disabling bitmaps	2016-05-10 13:40:28 -07:00
Nguyễn Thái Ngọc Duy	54d47394b4	builtin/pack-objects.c: use die_errno() and warning_errno() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-05-09 12:29:08 -07:00
Eric Wong	9cea46cdda	pack-objects: warn on split packs disabling bitmaps It can be tempting for a server admin to want a stable set of long-lived packs for dumb clients; but also want to enable bitmaps to serve smart clients more quickly. Unfortunately, such a configuration is impossible; so at least warn users of this incompatibility since commit `21134714` (pack-objects: turn off bitmaps when we split packs, 2014-10-16). Tested the warning by inspecting the output of: make -C t t5310-pack-bitmaps.sh GIT_TEST_OPTS=-v Signed-off-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-04-28 09:58:14 -07:00
brian m. carlson	7d924c9139	struct name_entry: use struct object_id instead of unsigned char sha1[20] Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-04-25 14:23:42 -07:00
Junio C Hamano	11529ecec9	Merge branch 'jk/tighten-alloc' Update various codepaths to avoid manually-counted malloc(). * jk/tighten-alloc: (22 commits) ewah: convert to REALLOC_ARRAY, etc convert ewah/bitmap code to use xmalloc diff_populate_gitlink: use a strbuf transport_anonymize_url: use xstrfmt git-compat-util: drop mempcpy compat code sequencer: simplify memory allocation of get_message test-path-utils: fix normalize_path_copy output buffer size fetch-pack: simplify add_sought_entry fast-import: simplify allocation in start_packfile write_untracked_extension: use FLEX_ALLOC helper prepare_{git,shell}_cmd: use argv_array use st_add and st_mult for allocation size computation convert trivial cases to FLEX_ARRAY macros use xmallocz to avoid size arithmetic convert trivial cases to ALLOC_ARRAY convert manual allocations to argv_array argv-array: add detach function add helpers for allocating flex-array structs harden REALLOC_ARRAY and xcalloc against size_t overflow tree-diff: catch integer overflow in combine_diff_path allocation ...	2016-02-26 13:37:16 -08:00
Jeff King	b32fa95fd8	convert trivial cases to ALLOC_ARRAY Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Jeff King	de1e67d070	list-objects: pass full pathname to callbacks When we find a blob at "a/b/c", we currently pass this to our show_object_fn callbacks as two components: "a/b/" and "c". Callbacks which want the full value then call path_name(), which concatenates the two. But this is an inefficient interface; the path is a strbuf, and we could simply append "c" to it temporarily, then roll back the length, without creating a new copy. So we could improve this by teaching the callsites of path_name() this trick (and there are only 3). But we can also notice that no callback actually cares about the broken-down representation, and simply pass each callback the full path "a/b/c" as a string. The callback code becomes even simpler, then, as we do not have to worry about freeing an allocated buffer, nor rolling back our modification to the strbuf. This is theoretically less efficient, as some callbacks would not bother to format the final path component. But in practice this is not measurable. Since we use the same strbuf over and over, our work to grow it is amortized, and we really only pay to memcpy a few bytes. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-12 12:51:17 -08:00
Jeff King	bd64516aca	list-objects: drop name_path entirely In the previous commit, we left name_path as a thin wrapper around a strbuf. This patch drops it entirely. As a result, every show_object_fn callback needs to be adjusted. However, none of their code needs to be changed at all, because the only use was to pass it to path_name(), which now handles the bare strbuf. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-12 12:51:15 -08:00
brian m. carlson	ed1c9977cb	Remove get_object_hash. Convert all instances of get_object_hash to use an appropriate reference to the hash member of the oid member of struct object. This provides no functional change, as it is essentially a macro substitution. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>	2015-11-20 08:02:05 -05:00
brian m. carlson	f2fd0760f6	Convert struct object to object_id struct object is one of the major data structures dealing with object IDs. Convert it to use struct object_id instead of an unsigned char array. Convert get_object_hash to refer to the new member as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>	2015-11-20 08:02:05 -05:00
brian m. carlson	7999b2cf77	Add several uses of get_object_hash. Convert most instances where the sha1 member of struct object is dereferenced to use get_object_hash. Most instances that are passed to functions that have versions taking struct object_id, such as get_sha1_hex/get_oid_hex, or instances that can be trivially converted to use struct object_id instead, are not converted. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>	2015-11-20 08:02:05 -05:00
Junio C Hamano	8746e30541	Merge branch 'ah/pack-objects-usage-strings' Usage string fix. * ah/pack-objects-usage-strings: pack-objects: place angle brackets around placeholders in usage strings	2015-09-01 16:31:12 -07:00
Alex Henrie	b8c1d27577	pack-objects: place angle brackets around placeholders in usage strings Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-28 11:59:10 -07:00
Charles Bailey	2a514ed805	parse-options: move unsigned long option parsing out of pack-objects.c The unsigned long option parsing (including 'k'/'m'/'g' suffix parsing) is more widely applicable. Add support for OPT_MAGNITUDE to parse-options.h and change pack-objects.c use this support. The error behavior on parse errors follows that of OPT_INTEGER. The name of the option that failed to parse is reported with a brief message describing the expect format for the option argument and then the full usage message for the command invoked. This differs from the previous behavior for OPT_ULONG used in pack-objects for --max-pack-size and --window-memory which used to display the value supplied in the error message and did not display the full usage message. Signed-off-by: Charles Bailey <cbailey32@bloomberg.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-06-22 15:07:21 -07:00
Michael Haggerty	d155254c73	builtin/pack-objects: rewrite to take an object_id argument Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-25 12:19:29 -07:00
Michael Haggerty	2b2a5be394	each_ref_fn: change to take an object_id parameter Change typedef each_ref_fn to take a "const struct object_id oid" parameter instead of "const unsigned char sha1". To aid this transition, implement an adapter that can be used to wrap old-style functions matching the old typedef, which is now called "each_ref_sha1_fn"), and make such functions callable via the new interface. This requires the old function and its cb_data to be wrapped in a "struct each_ref_fn_sha1_adapter", and that object to be used as the cb_data for an adapter function, each_ref_fn_adapter(). This is an enormous diff, but most of it consists of simple, mechanical changes to the sites that call any of the "for_each_ref" family of functions. Subsequent to this change, the call sites can be rewritten one by one to use the new interface. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-25 12:19:27 -07:00
Junio C Hamano	cedeffeee0	Merge branch 'jk/sha1-file-reduce-useless-warnings' * jk/sha1-file-reduce-useless-warnings: sha1_file: squelch "packfile cannot be accessed" warnings	2015-05-11 14:23:41 -07:00
Jeff King	319b678a7b	sha1_file: squelch "packfile cannot be accessed" warnings When we find an object in a packfile index, we make sure we can still open the packfile itself (or that it is already open), as it might have been deleted by a simultaneous repack. If we can't access the packfile, we print a warning for the user and tell the caller that we don't have the object (we can then look in other packfiles, or find a loose version, before giving up). The warning we print to the user isn't really accomplishing anything, and it is potentially confusing to users. In the normal case, it is complete noise; we find the object elsewhere, and the user does not have to care that we racily saw a packfile index that became stale. It didn't affect the operation at all. A possibly more interesting case is when we later can't find the object, and report failure to the user. In this case the warning could be considered a clue toward that ultimate failure. But it's not really a useful clue in practice. We wouldn't even print it consistently (since we are racing with another process, we might not even see the .idx file, or we might win the race and open the packfile, completing the operation). This patch drops the warning entirely (not only from the fill_pack_entry site, but also from an identical use in pack-objects). If we did find the warning interesting in the error case, we could stuff it away and reveal it to the user when we later die() due to the broken object. But that complexity just isn't worth it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-03-30 21:47:39 -07:00
Junio C Hamano	6902c4da58	Merge branch 'rs/deflate-init-cleanup' Code simplification. * rs/deflate-init-cleanup: zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}	2015-03-17 16:01:26 -07:00
René Scharfe	9a6f1287fb	zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw} Clear the git_zstream variable at the start of git_deflate_init() etc. so that callers don't have to do that. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-03-05 15:46:03 -08:00
brian m. carlson	2dacf26d09	pack-objects: use --objects-edge-aggressive for shallow repos When fetching into or pushing from a shallow repository, we want to aggressively mark edges as uninteresting, since this decreases the pack size. However, aggressively marking edges can negatively affect performance on large non-shallow repositories with lots of refs. Teach pack-objects a --shallow option to indicate that we're pushing from or fetching into a shallow repository. Use --objects-edge-aggressive only for shallow repositories and otherwise use --objects-edge, which performs better in the general case. Update the callers to pass the --shallow option when they are dealing with a shallow repository. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-12-29 09:58:25 -08:00
brian m. carlson	1684c1b219	rev-list: add an option to mark fewer edges as uninteresting In commit `fbd4a70` (list-objects: mark more commits as edges in mark_edges_uninteresting - 2013-08-16), we marked an increasing number of edges uninteresting. This change, and the subsequent change to make this conditional on --objects-edge, are used by --thin to make much smaller packs for shallow clones. Unfortunately, they cause a significant performance regression when pushing non-shallow clones with lots of refs (23.322 seconds vs. 4.785 seconds with 22400 refs). Add an option to git rev-list, --objects-edge-aggressive, that preserves this more aggressive behavior, while leaving --objects-edge to provide more performant behavior. Preserve the current behavior for the moment by using the aggressive option. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-12-29 09:57:55 -08:00
Junio C Hamano	d70e331c0e	Merge branch 'jk/prune-mtime' Tighten the logic to decide that an unreachable cruft is sufficiently old by covering corner cases such as an ancient object becoming reachable and then going unreachable again, in which case its retention period should be prolonged. * jk/prune-mtime: (28 commits) drop add_object_array_with_mode revision: remove definition of unused 'add_object' function pack-objects: double-check options before discarding objects repack: pack objects mentioned by the index pack-objects: use argv_array reachable: use revision machinery's --indexed-objects code rev-list: add --indexed-objects option rev-list: document --reflog option t5516: test pushing a tag of an otherwise unreferenced blob traverse_commit_list: support pending blobs/trees with paths make add_object_array_with_context interface more sane write_sha1_file: freshen existing objects pack-objects: match prune logic for discarding objects pack-objects: refactor unpack-unreachable expiration check prune: keep objects reachable from recent objects sha1_file: add for_each iterators for loose and packed objects count-objects: use for_each_loose_file_in_objdir count-objects: do not use xsize_t when counting object size prune-packed: use for_each_loose_file_in_objdir reachable: mark index blobs as SEEN ...	2014-10-29 10:07:56 -07:00
Junio C Hamano	e4da4fbe0e	Merge branch 'eb/no-pthreads' Allow us build with NO_PTHREADS=NoThanks compilation option. * eb/no-pthreads: Handle atexit list internaly for unthreaded builds pack-objects: set number of threads before checking and warning index-pack: fix compilation with NO_PTHREADS	2014-10-24 14:59:10 -07:00
Junio C Hamano	26a22d8d00	Merge branch 'jk/pack-objects-no-bitmap-when-splitting' Splitting pack-objects output into multiple packs is incompatible with the use of reachability bitmap. * jk/pack-objects-no-bitmap-when-splitting: pack-objects: turn off bitmaps when we split packs	2014-10-24 14:56:10 -07:00
Jeff King	2113471478	pack-objects: turn off bitmaps when we split packs If a pack.packSizeLimit is set, we may split the pack data across multiple packfiles. This means we cannot generate .bitmap files, as they require that all of the reachable objects are in the same pack. We check that condition when we are generating the list of objects to pack (and disable bitmaps if we are not packing everything), but we forgot to update it when we notice that we needed to split (which doesn't happen until the actual write phase). The resulting bitmaps are quite bogus (they mention entries that do not exist in the pack!) and can cause a fetch or push to send insufficient objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-19 15:08:38 -07:00
Jeff King	b1e757f363	pack-objects: double-check options before discarding objects When we are given an expiration time like --unpack-unreachable=2.weeks.ago, we avoid writing out old, unreachable loose objects entirely, under the assumption that running "prune" would simply delete them immediately anyway. However, this is only valid if we computed the same set of reachable objects as prune would. In practice, this is the case, because only git-repack uses the --unpack-unreachable option with an expiration, and it always feeds as many objects into the pack as possible. But we can double-check at runtime just to be sure. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-19 15:07:07 -07:00
Jeff King	c90f9e13ab	repack: pack objects mentioned by the index When we pack all objects, we use only the objects reachable from references and reflogs. This misses any objects which are reachable from the index, but not yet referenced. By itself this isn't a big deal; the objects can remain loose until they are actually used in a commit. However, it does create a problem when we drop packed but unreachable objects. We try to optimize out the writing of objects that we will immediately prune, which means we must follow the same rules as prune in determining what is reachable. And prune uses the index for this purpose. This is rather uncommon in practice, as objects in the index would not usually have been packed in the first place. But it could happen in a sequence like: 1. You make a commit on a branch that references blob X. 2. You repack, moving X into the pack. 3. You delete the branch (and its reflog), so that X is unreferenced. 4. You "git add" blob X so that it is now referenced only by the index. 5. You repack again with git-gc. The pack-objects we invoke will see that X is neither referenced nor recent and not bother loosening it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-19 15:07:07 -07:00
Jeff King	edfbb2aa53	pack-objects: use argv_array This saves us from having to bump the rp_av count when we add new traversal options. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-19 15:07:07 -07:00
Jeff King	abcb86553d	pack-objects: match prune logic for discarding objects A recent commit taught git-prune to keep non-recent objects that are reachable from recent ones. However, pack-objects, when loosening unreachable objects, tries to optimize out the write in the case that the object will be immediately pruned. It now gets this wrong, since its rule does not reflect the new prune code (and this can be seen by running t6501 with a strategically placed repack). Let's teach pack-objects similar logic. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-16 10:10:43 -07:00
Jeff King	d0d46abc16	pack-objects: refactor unpack-unreachable expiration check When we are loosening unreachable packed objects, we do not bother to process objects that would simply be pruned immediately anyway. The "would be pruned" check is a simple comparison, but is about to get more complicated. Let's pull it out into a separate function. Note that this is slightly less efficient than the original, which avoided even opening old packs, since no object in them could pass the current check, which cares only about the pack mtime. But the new rules will depend on the exact object, so we need to perform the check even for old packs. Note also that we fix a minor buglet when the pack mtime is exactly the same as the expiration time. The prune code considers that worth pruning, whereas our check here considered it worth keeping. This wasn't a big deal. Besides being unlikely to happen, the result was simply that the object was loosened and then pruned, missing the optimization. Still, we can easily fix it while we are here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-16 10:10:42 -07:00
Junio C Hamano	0c45d258ec	pack-objects: set number of threads before checking and warning Under NO_PTHREADS build, we warn when delta_search_threads is not set to 1, because that is the only sensible value on a single threaded build. However, the auto detection that kicks in when that variable is set to 0 (e.g. there is no configuration variable or command line option, or an explicit --threads=0 is given from the command line to override the pack.threads configuration to force auto-detection) was not done before the condition to issue this warning was tested. Move the auto-detection code and place it at an appropriate spot. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-10-13 12:53:46 -07:00
René Scharfe	2756ca4347	use REALLOC_ARRAY for changing the allocation size of arrays Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-09-18 09:13:42 -07:00
Junio C Hamano	a3d54f9a1f	Merge branch 'jk/pack-shallow-always-without-bitmap' Reachability bitmaps do not work with shallow operations. Fixes regression in 2.0. * jk/pack-shallow-always-without-bitmap: pack-objects: turn off bitmaps when we see --shallow lines	2014-08-26 11:16:25 -07:00
Jeff King	f7f91086a3	pack-objects: turn off bitmaps when we see --shallow lines Reachability bitmaps do not work with shallow operations, because they cache a view of the object reachability that represents the true objects. Whereas a shallow repository (or a shallow operation in a repository) is inherently cutting off the object graph with a graft. We explicitly disallow the use of bitmaps in shallow repositories by checking is_repository_shallow(), and we should continue to do that. However, we also want to disallow bitmaps when we are serving a fetch to a shallow client, since we momentarily take on their grafted view of the world. It used to be enough to call is_repository_shallow at the start of pack-objects. Upload-pack wrote the other side's shallow state to a temporary file and pointed the whole pack-objects process at this state with "git --shallow-file", and from the perspective of pack-objects, we really were in a shallow repo. But since `b790e0f` (upload-pack: send shallow info over stdin to pack-objects, 2014-03-11), we do it differently: we send --shallow lines to pack-objects over stdin, and it registers them itself. This means that our is_repository_shallow check is way too early (we have not been told about the shallowness yet), and that it is insufficient (calling is_repository_shallow is not enough, as the shallow grafts we register do not change its return value). Instead, we can just turn off bitmaps explicitly when we see these lines. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-08-12 12:17:19 -07:00
Junio C Hamano	25f3119000	Merge branch 'jk/repack-pack-writebitmaps-config' * jk/repack-pack-writebitmaps-config: t7700: drop explicit --no-pack-kept-objects from .keep test repack: introduce repack.writeBitmaps config option repack: simplify handling of --write-bitmap-index pack-objects: stop respecting pack.writebitmaps	2014-06-25 12:23:19 -07:00
Jeff King	15a906c5e9	pack-objects: stop respecting pack.writebitmaps The handling of the pack.writebitmaps config option originally happened in pack-objects, which is quite low-level. It would make more sense for drivers of pack-objects to read the config, and then manipulate pack-objects with command-line options. Recently, repack learned to do so, making the low-level read of pack.writebitmaps redundant here. Other callers, like upload-pack, would not generally want to write bitmaps anyway. This could be considered a regression for somebody who is driving pack-objects themselves outside of repack and expects the config option to be used. However, such users seem rather unlikely given how new the bitmap code is (and the fact that they would basically be reimplementing repack in the first place). Note that we do not do anything with pack.writeBitmapHashCache here. That option is not about "do we write bimaps", but rather "when we are writing bitmaps, how do we do it?". You would want that to kick in anytime you decide to write them, similar to how pack.indexVersion is used. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-06-10 14:01:53 -07:00
Junio C Hamano	967f8c9184	Merge branch 'jk/pack-bitmap' * jk/pack-bitmap: pack-objects: do not reuse packfiles without --delta-base-offset add `ignore_missing_links` mode to revwalk	2014-04-08 12:00:33 -07:00
Junio C Hamano	d59c12d7ad	Merge branch 'jl/nor-or-nand-and' Eradicate mistaken use of "nor" (that is, essentially "nor" used not in "neither A nor B" ;-)) from in-code comments, command output strings, and documentations. * jl/nor-or-nand-and: code and test: fix misuses of "nor" comments: fix misuses of "nor" contrib: fix misuses of "nor" Documentation: fix misuses of "nor"	2014-04-08 12:00:28 -07:00
Jeff King	69e4b3426a	pack-objects: do not reuse packfiles without --delta-base-offset When we are sending a packfile to a remote, we currently try to reuse a whole chunk of packfile without bothering to look at the individual objects. This can make things like initial clones much lighter on the server, as we can just dump the packfile bytes. However, it's possible that the other side cannot read our packfile verbatim. For example, we may have objects stored as OFS_DELTA, but the client is an antique version of git that only understands REF_DELTA. We negotiate this capability over the fetch protocol. A normal pack-objects run will convert OFS_DELTA into REF_DELTA on the fly, but the "reuse pack" code path never even looks at the objects. This patch disables packfile reuse if the other side is missing any capabilities that we might have used in the on-disk pack. Right now the only one is OFS_DELTA, but we may need to expand in the future (e.g., if packv4 introduces new object types). We could be more thorough and only disable reuse in this case when we actually have an OFS_DELTA to send, but: 1. We almost always will have one, since we prefer OFS_DELTA to REF_DELTA when possible. So this case would almost never come up. 2. Looking through the objects defeats the purpose of the optimization, which is to do as little work as possible to get the bytes to the remote. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-04-04 15:29:44 -07:00
Justin Lebar	01689909eb	comments: fix misuses of "nor" Signed-off-by: Justin Lebar <jlebar@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-31 15:29:27 -07:00
Junio C Hamano	0ddcc9cfba	Merge branch 'jk/pack-bitmap-progress' The progress output while repacking and transferring objects showed an apparent large silence while writing the objects out of existing packfiles, when the reachability bitmap was in use. * jk/pack-bitmap-progress: pack-objects: show reused packfile objects in "Counting objects" pack-objects: show progress for reused packfiles	2014-03-28 13:50:56 -07:00
Junio C Hamano	e2450e1245	Merge branch 'jk/pack-bitmap' Instead of dying when asked to (re)pack with the reachability bitmap when a bitmap cannot be built, just (re)pack without producing a bitmap in such a case, with a warning. * jk/pack-bitmap: pack-objects: turn off bitmaps when skipping objects	2014-03-28 13:50:50 -07:00
Junio C Hamano	1ddb4d7e5e	Merge branch 'nd/upload-pack-shallow' Serving objects from a shallow repository needs to write a temporary file to be used, but the serving upload-pack may not have write access to the repository which is meant to be read-only. Instead feed these temporary shallow bounds from the standard input of pack-objects so that we do not have to use a temporary file. * nd/upload-pack-shallow: upload-pack: send shallow info over stdin to pack-objects	2014-03-21 12:49:08 -07:00
Junio C Hamano	f4eec8ce05	Merge branch 'sh/finish-tmp-packfile' * sh/finish-tmp-packfile: finish_tmp_packfile():use strbuf for pathname construction	2014-03-18 13:50:24 -07:00
Junio C Hamano	fe9122a352	Merge branch 'dd/use-alloc-grow' Replace open-coded reallocation with ALLOC_GROW() macro. * dd/use-alloc-grow: sha1_file.c: use ALLOC_GROW() in pretend_sha1_file() read-cache.c: use ALLOC_GROW() in add_index_entry() builtin/mktree.c: use ALLOC_GROW() in append_to_tree() attr.c: use ALLOC_GROW() in handle_attr_line() dir.c: use ALLOC_GROW() in create_simplify() reflog-walk.c: use ALLOC_GROW() replace_object.c: use ALLOC_GROW() in register_replace_object() patch-ids.c: use ALLOC_GROW() in add_commit() diffcore-rename.c: use ALLOC_GROW() diff.c: use ALLOC_GROW() commit.c: use ALLOC_GROW() in register_commit_graft() cache-tree.c: use ALLOC_GROW() in find_subtree() bundle.c: use ALLOC_GROW() in add_to_ref_list() builtin/pack-objects.c: use ALLOC_GROW() in check_pbase_path()	2014-03-18 13:50:21 -07:00
Jeff King	373c67da1d	pack-objects: turn off bitmaps when skipping objects The pack bitmap format requires that we have a single bit for each object in the pack, and that each object's bitmap represents its complete set of reachable objects. Therefore we have no way to represent the bitmap of an object which references objects outside the pack. We notice this problem while generating the bitmaps, as we try to find the offset of a particular object and realize that we do not have it. In this case we die, and neither the bitmap nor the pack is generated. This is correct, but perhaps a little unfriendly. If you have bitmaps turned on in the config, many repacks will fail which would otherwise succeed. E.g., incremental repacks, repacks with "-l" when you have alternates, ".keep" files. Instead, this patch notices early that we are omitting some objects from the pack and turns off bitmaps (with a warning). Note that this is not strictly correct, as it's possible that the object being omitted is not reachable from any other object in the pack. In practice, this is almost never the case, and there are two advantages to doing it this way: 1. The code is much simpler, as we do not have to cleanly abort the bitmap-generation process midway through. 2. We do not waste time partially generating bitmaps only to find out that some object deep in the history is not being packed. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-17 15:02:39 -07:00
Jeff King	78d2214eb4	pack-objects: show reused packfile objects in "Counting objects" When we are sending a pack for push or fetch, we may reuse a chunk of packfile without even parsing it. The progress meter then looks like this: Reusing existing pack: 3440489, done. Counting objects: 3, done. The first line shows that we are reusing a large chunk of objects, and then we further count any objects not included in the reused portion with an actual traversal. These are all implementation details that the user does not need to care about. Instead, we can show the reused objects in the normal "counting..." progress meter (which will simply go much faster than normal), and then continue to add to it as we traverse. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-17 15:01:27 -07:00
Jeff King	657673f125	pack-objects: show progress for reused packfiles When the "--all-progress" option is in effect, pack-objects shows a progress report for the "writing" phase. If the repository has bitmaps and we are reusing a packfile, the user sees no progress update until the whole packfile is sent. Since this is typically the bulk of what is being written, it can look like git hangs during this phase, even though the transfer is proceeding. This generally only happens with "git push" from a repository with bitmaps. We do not use "--all-progress" for fetch (since the result is going to index-pack on the client, which takes care of progress reporting). And for regular repacks to disk, we do not reuse packfiles. We already have the progress meter setup during write_reused_pack; we just need to call display_progress whiel we are writing out the pack. The progress meter is attached to our output descriptor, so it automatically handles the throughput measurements. However, we need to update the object count as we go, since that is what feeds the percentage we show. We aren't actually parsing the packfile as we send it, so we have no idea how many objects we have sent; we only know that at the end of N bytes, we will have sent M objects. So we cheat a little and assume each object is M/N bytes (i.e., the mean of the objects we are sending). While this isn't strictly true, it actually produces a more pleasing progress meter for the user, as it moves smoothly and predictably (and nobody really cares about the object count; they care about the percentage, and the object count is a proxy for that). One alternative would be to actually show two progress meters: one for the reused pack, and one for the rest of the objects. That would more closely reflect the data we have (the first would be measured in bytes, and the second measured in objects). But it would also be more complex and annoying to the user; rather than seeing one progress meter counting up to 100%, they would finish one meter, then start another one at zero. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-17 15:01:25 -07:00
Junio C Hamano	56e2874a81	Merge branch 'sh/write-pack-file-warning-message-fix' A warning from "git pack-objects" were generated by referring to an incorrect variable when forming the filename that we had trouble with. * sh/write-pack-file-warning-message-fix: write_pack_file: use correct variable in diagnostic	2014-03-14 14:27:17 -07:00
Junio C Hamano	3e30cb0fbf	Merge branch 'mh/replace-refs-variable-rename' * mh/replace-refs-variable-rename: Document some functions defined in object.c Add docstrings for lookup_replace_object() and do_lookup_replace_object() rename read_replace_refs to check_replace_refs	2014-03-14 14:27:06 -07:00
Junio C Hamano	0963008cbf	Merge branch 'nd/i18n-progress' Mark the progress indicators from various time-consuming commands for i18n/l10n. * nd/i18n-progress: i18n: mark all progress lines for translation	2014-03-14 14:26:31 -07:00
Nguyễn Thái Ngọc Duy	b790e0f67c	upload-pack: send shallow info over stdin to pack-objects Before `cdab485` (upload-pack: delegate rev walking in shallow fetch to pack-objects - 2013-08-16) upload-pack does not write to the source repository. `cdab485` starts to write $GIT_DIR/shallow_XXXXXX if it's a shallow fetch, so the source repo must be writable. git:// servers do not need write access to repos and usually don't have it, which means `cdab485` breaks shallow clone over git:// Instead of using a temporary file as the media for shallow points, we can send them over stdin to pack-objects as well. Prepend shallow SHA-1 with --shallow so pack-objects knows what is what. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-11 13:32:10 -07:00
Dmitry S. Dolzhenko	25e1940709	builtin/pack-objects.c: use ALLOC_GROW() in check_pbase_path() Signed-off-by: Dmitry S. Dolzhenko <dmitrys.dolzhenko@yandex.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-03 14:44:11 -08:00
Sun He	5889271114	finish_tmp_packfile():use strbuf for pathname construction The old version fixes a maximum length on the buffer, which could be a problem if one is not certain of the length of get_object_directory(). Using strbuf can avoid the protential bug. Helped-by: Michael Haggerty <mhagger@alum.mit.edu> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Sun He <sunheehnus@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-03 12:15:10 -08:00
Junio C Hamano	2156a98045	Merge branch 'sh/write-pack-file-warning-message-fix' into sh/finish-tmp-packfile * sh/write-pack-file-warning-message-fix: write_pack_file: use correct variable in diagnostic	2014-03-03 12:13:20 -08:00
Sun He	0eea5a6e91	write_pack_file: use correct variable in diagnostic 'pack_tmp_name' is the subject of the utime() check, so report it in the warning, not the uninitialized 'tmpname' Signed-off-by: Sun He <sunheehnus@gmail.com> Reviewed-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-03 10:43:40 -08:00
Junio C Hamano	0f9e62e084	Merge branch 'jk/pack-bitmap' Borrow the bitmap index into packfiles from JGit to speed up enumeration of objects involved in a commit range without having to fully traverse the history. * jk/pack-bitmap: (26 commits) ewah: unconditionally ntohll ewah data ewah: support platforms that require aligned reads read-cache: use get_be32 instead of hand-rolled ntoh_l block-sha1: factor out get_be and put_be wrappers do not discard revindex when re-preparing packfiles pack-bitmap: implement optional name_hash cache t/perf: add tests for pack bitmaps t: add basic bitmap functionality tests count-objects: recognize .bitmap in garbage-checking repack: consider bitmaps when performing repacks repack: handle optional files created by pack-objects repack: turn exts array into array-of-struct repack: stop using magic number for ARRAY_SIZE(exts) pack-objects: implement bitmap writing rev-list: add bitmap mode to speed up object lists pack-objects: use bitmaps when packing objects pack-objects: split add_object_entry pack-bitmap: add support for bitmap indexes documentation: add documentation for the bitmap format ewah: compressed bitmap implementation ...	2014-02-27 14:01:48 -08:00
Nguyễn Thái Ngọc Duy	754dbc43f0	i18n: mark all progress lines for translation Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-02-24 09:08:37 -08:00
Michael Haggerty	afc711b8e1	rename read_replace_refs to check_replace_refs The semantics of this flag was changed in commit `e1111cef23` inline lookup_replace_object() calls but wasn't renamed at the time to minimize code churn. Rename it now, and add a comment explaining its use. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-02-20 14:16:55 -08:00
Vicent Marti	ae4f07fbcc	pack-bitmap: implement optional name_hash cache When we use pack bitmaps rather than walking the object graph, we end up with the list of objects to include in the packfile, but we do not know the path at which any tree or blob objects would be found. In a recently packed repository, this is fine. A fetch would use the paths only as a heuristic in the delta compression phase, and a fully packed repository should not need to do much delta compression. As time passes, though, we may acquire more objects on top of our large bitmapped pack. If clients fetch frequently, then they never even look at the bitmapped history, and all works as usual. However, a client who has not fetched since the last bitmap repack will have "have" tips in the bitmapped history, but "want" newer objects. The bitmaps themselves degrade gracefully in this circumstance. We manually walk the more recent bits of history, and then use bitmaps when we hit them. But we would also like to perform delta compression between the newer objects and the bitmapped objects (both to delta against what we know the user already has, but also between "new" and "old" objects that the user is fetching). The lack of pathnames makes our delta heuristics much less effective. This patch adds an optional cache of the 32-bit name_hash values to the end of the bitmap file. If present, a reader can use it to match bitmapped and non-bitmapped names during delta compression. Here are perf results for p5310: Test origin/master HEAD^ HEAD ------------------------------------------------------------------------------------------------- 5310.2: repack to disk 36.81(37.82+1.43) 47.70(48.74+1.41) +29.6% 47.75(48.70+1.51) +29.7% 5310.3: simulated clone 30.78(29.70+2.14) 1.08(0.97+0.10) -96.5% 1.07(0.94+0.12) -96.5% 5310.4: simulated fetch 3.16(6.10+0.08) 3.54(10.65+0.06) +12.0% 1.70(3.07+0.06) -46.2% 5310.6: partial bitmap 36.76(43.19+1.81) 6.71(11.25+0.76) -81.7% 4.08(6.26+0.46) -88.9% You can see that the time spent on an incremental fetch goes down, as our delta heuristics are able to do their work. And we save time on the partial bitmap clone for the same reason. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-30 12:19:23 -08:00
Vicent Marti	7cc8f97108	pack-objects: implement bitmap writing This commit extends more the functionality of `pack-objects` by allowing it to write out a `.bitmap` index next to any written packs, together with the `.idx` index that currently gets written. If bitmap writing is enabled for a given repository (either by calling `pack-objects` with the `--write-bitmap-index` flag or by having `pack.writebitmaps` set to `true` in the config) and pack-objects is writing a packfile that would normally be indexed (i.e. not piping to stdout), we will attempt to write the corresponding bitmap index for the packfile. Bitmap index writing happens after the packfile and its index has been successfully written to disk (`finish_tmp_packfile`). The process is performed in several steps: 1. `bitmap_writer_set_checksum`: this call stores the partial checksum for the packfile being written; the checksum will be written in the resulting bitmap index to verify its integrity 2. `bitmap_writer_build_type_index`: this call uses the array of `struct object_entry` that has just been sorted when writing out the actual packfile index to disk to generate 4 type-index bitmaps (one for each object type). These bitmaps have their nth bit set if the given object is of the bitmap's type. E.g. the nth bit of the Commits bitmap will be 1 if the nth object in the packfile index is a commit. This is a very cheap operation because the bitmap writing code has access to the metadata stored in the `struct object_entry` array, and hence the real type for each object in the packfile. 3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap index for one of the packfiles we're trying to repack, this call will efficiently rebuild the existing bitmaps so they can be reused on the new index. All the existing bitmaps will be stored in a `reuse` hash table, and the commit selection phase will prioritize these when selecting, as they can be written directly to the new index without having to perform a revision walk to fill the bitmap. This can greatly speed up the repack of a repository that already has bitmaps. 4. `bitmap_writer_select_commits`: if bitmap writing is enabled for a given `pack-objects` run, the sequence of commits generated during the Counting Objects phase will be stored in an array. We then use that array to build up the list of selected commits. Writing a bitmap in the index for each object in the repository would be cost-prohibitive, so we use a simple heuristic to pick the commits that will be indexed with bitmaps. The current heuristics are a simplified version of JGit's original implementation. We select a higher density of commits depending on their age: the 100 most recent commits are always selected, after that we pick 1 commit of each 100, and the gap increases as the commits grow older. On top of that, we make sure that every single branch that has not been merged (all the tips that would be required from a clone) gets their own bitmap, and when selecting commits between a gap, we tend to prioritize the commit with the most parents. Do note that there is no right/wrong way to perform commit selection; different selection algorithms will result in different commits being selected, but there's no such thing as "missing a commit". The bitmap walker algorithm implemented in `prepare_bitmap_walk` is able to adapt to missing bitmaps by performing manual walks that complete the bitmap: the ideal selection algorithm, however, would select the commits that are more likely to be used as roots for a walk in the future (e.g. the tips of each branch, and so on) to ensure a bitmap for them is always available. 5. `bitmap_writer_build`: this is the computationally expensive part of bitmap generation. Based on the list of commits that were selected in the previous step, we perform several incremental walks to generate the bitmap for each commit. The walks begin from the oldest commit, and are built up incrementally for each branch. E.g. consider this dag where A, B, C, D, E, F are the selected commits, and a, b, c, e are a chunk of simplified history that will not receive bitmaps. A---a---B--b--C--c--D \ E--e--F We start by building the bitmap for A, using A as the root for a revision walk and marking all the objects that are reachable until the walk is over. Once this bitmap is stored, we reuse the bitmap walker to perform the walk for B, assuming that once we reach A again, the walk will be terminated because A has already been SEEN on the previous walk. This process is repeated for C, and D, but when we try to generate the bitmaps for E, we can reuse neither the current walk nor the bitmap we have generated so far. What we do now is resetting both the walk and clearing the bitmap, and performing the walk from scratch using E as the origin. This new walk, however, does not need to be completed. Once we hit B, we can lookup the bitmap we have already stored for that commit and OR it with the existing bitmap we've composed so far, allowing us to limit the walk early. After all the bitmaps have been generated, another iteration through the list of commits is performed to find the best XOR offsets for compression before writing them to disk. Because of the incremental nature of these bitmaps, XORing one of them with its predecesor results in a minimal "bitmap delta" most of the time. We can write this delta to the on-disk bitmap index, and then re-compose the original bitmaps by XORing them again when loaded. This is a phase very similar to pack-object's `find_delta` (using bitmaps instead of objects, of course), except the heuristics have been greatly simplified: we only check the 10 bitmaps before any given one to find best compressing one. This gives good results in practice, because there is locality in the ordering of the objects (and therefore bitmaps) in the packfile. 6. `bitmap_writer_finish`: the last step in the process is serializing to disk all the bitmap data that has been generated in the two previous steps. The bitmap is written to a tmp file and then moved atomically to its final destination, using the same process as `pack-write.c:write_idx_file`. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-30 12:19:22 -08:00
Vicent Marti	6b8fda2db1	pack-objects: use bitmaps when packing objects In this patch, we use the bitmap API to perform the `Counting Objects` phase in pack-objects, rather than a traditional walk through the object graph. For a reasonably-packed large repo, the time to fetch and clone is often dominated by the full-object revision walk during the Counting Objects phase. Using bitmaps can reduce the CPU time required on the server (and therefore start sending the actual pack data with less delay). For bitmaps to be used, the following must be true: 1. We must be packing to stdout (as a normal `pack-objects` from `upload-pack` would do). 2. There must be a .bitmap index containing at least one of the "have" objects that the client is asking for. 3. Bitmaps must be enabled (they are enabled by default, but can be disabled by setting `pack.usebitmaps` to false, or by using `--no-use-bitmap-index` on the command-line). If any of these is not true, we fall back to doing a normal walk of the object graph. Here are some sample timings from a full pack of `torvalds/linux` (i.e. something very similar to what would be generated for a clone of the repository) that show the speedup produced by various methods: [existing graph traversal] $ time git pack-objects --all --stdout --no-use-bitmap-index \ </dev/null >/dev/null Counting objects: 3237103, done. Compressing objects: 100% (508752/508752), done. Total 3237103 (delta 2699584), reused 3237103 (delta 2699584) real 0m44.111s user 0m42.396s sys 0m3.544s [bitmaps only, without partial pack reuse; note that pack reuse is automatic, so timing this required a patch to disable it] $ time git pack-objects --all --stdout </dev/null >/dev/null Counting objects: 3237103, done. Compressing objects: 100% (508752/508752), done. Total 3237103 (delta 2699584), reused 3237103 (delta 2699584) real 0m5.413s user 0m5.604s sys 0m1.804s [bitmaps with pack reuse (what you get with this patch)] $ time git pack-objects --all --stdout </dev/null >/dev/null Reusing existing pack: 3237103, done. Total 3237103 (delta 0), reused 0 (delta 0) real 0m1.636s user 0m1.460s sys 0m0.172s Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-30 12:19:22 -08:00
Jeff King	ce2bc42456	pack-objects: split add_object_entry This function actually does three things: 1. Check whether we've already added the object to our packing list. 2. Check whether the object meets our criteria for adding. 3. Actually add the object to our packing list. It's a little hard to see these three phases, because they happen linearly in the rather long function. Instead, this patch breaks them up into three separate helper functions. The result is a little easier to follow, though it unfortunately suffers from some optimization interdependencies between the stages (e.g., during step 3 we use the packing list index from step 1 and the packfile information from step 2). More importantly, though, the various parts can be composed differently, as they will be in the next patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-30 12:19:22 -08:00
Jeff King	9af270e8c2	do not pretend sha1write returns errors The sha1write function returns an int, but it will always be "0". The failure-prone parts of the function happen in the "flush" callback, which cannot pass an error back to us. So we just end up calling die() during the flush. Let's just drop the return value altogether, as it only confuses callers into thinking that it might be useful. Only one call site actually checked the return value. We can drop that check, since it just led to a die() anyway. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-26 11:50:20 -08:00
Christian Couder	5955654823	replace {pre,suf}fixcmp() with {starts,ends}_with() Leaving only the function definitions and declarations so that any new topic in flight can still make use of the old functions, replace existing uses of the prefixcmp() and suffixcmp() with new API functions. The change can be recreated by mechanically applying this: $ git grep -l -e prefixcmp -e suffixcmp -- \*.c \| grep -v strbuf\\.c \| xargs perl -pi -e ' s\|!prefixcmp\(\|starts_with\(\|g; s\|prefixcmp\(\|!starts_with\(\|g; s\|!suffixcmp\(\|ends_with\(\|g; s\|suffixcmp\(\|!ends_with\(\|g; ' on the result of preparatory changes in this series. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-12-05 14:13:21 -08:00
Vicent Marti	68fb36eb92	pack-objects: factor out name_hash As the pack-objects system grows beyond the single pack-objects.c file, more parts (like the soon-to-exist bitmap code) will need to compute hashes for matching deltas. Factor out name_hash to make it available to other files. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-10-24 15:44:52 -07:00
Vicent Marti	2834bc27c1	pack-objects: refactor the packing list The hash table that stores the packing list for a given `pack-objects` run was tightly coupled to the pack-objects code. In this commit, we refactor the hash table and the underlying storage array into a `packing_data` struct. The functionality for accessing and adding entries to the packing list is hence accessible from other parts of Git besides the `pack-objects` builtin. This refactoring is a requirement for further patches in this series that will require accessing the commit packing list from outside of `pack-objects`. The hash table implementation has been minimally altered: we now use table sizes which are always a power of two, to ensure a uniform index distribution in the array. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-10-24 15:44:48 -07:00
Junio C Hamano	eeb8e8373f	Merge branch 'jc/pack-objects' * jc/pack-objects: pack-objects: shrink struct object_entry	2013-10-23 13:21:26 -07:00
Junio C Hamano	238504b014	Merge branch 'nd/fetch-into-shallow' When there is no sufficient overlap between old and new history during a fetch into a shallow repository, we unnecessarily sent objects the sending side knows the receiving end has. * nd/fetch-into-shallow: Add testcase for needless objects during a shallow fetch list-objects: mark more commits as edges in mark_edges_uninteresting list-objects: reduce one argument in mark_edges_uninteresting upload-pack: delegate rev walking in shallow fetch to pack-objects shallow: add setup_temporary_shallow() shallow: only add shallow graft points to new shallow file move setup_alternate_shallow and write_shallow_commits to shallow.c	2013-09-20 12:25:32 -07:00
Nguyễn Thái Ngọc Duy	e76a5fb459	list-objects: reduce one argument in mark_edges_uninteresting mark_edges_uninteresting() is always called with this form mark_edges_uninteresting(revs->commits, revs, ...); Remove the first argument and let mark_edges_uninteresting figure that out by itself. It helps answer the question "are this commit list and revs related in any way?" when looking at mark_edges_uninteresting implementation. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-08-28 11:54:18 -07:00
Brandon Casey	7c3ecb3254	Don't close pack fd when free'ing pack windows Now that close_one_pack() has been introduced to handle file descriptor pressure, it is not strictly necessary to close the pack file descriptor in unuse_one_window() when we're under memory pressure. Jeff King provided a justification for leaving the pack file open: If you close packfile descriptors, you can run into racy situations where somebody else is repacking and deleting packs, and they go away while you are trying to access them. If you keep a descriptor open, you're fine; they last to the end of the process. If you don't, then they disappear from under you. For normal object access, this isn't that big a deal; we just rescan the packs and retry. But if you are packing yourself (e.g., because you are a pack-objects started by upload-pack for a clone or fetch), it's much harder to recover (and we print some warnings). Let's do so (or uh, not do so). Signed-off-by: Brandon Casey <drafnel@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-08-02 09:27:26 -07:00
Junio C Hamano	63cdcfa40f	pack-objects: shrink struct object_entry Turn some boolean fields into bitfields and use uint32_t for name hash. This shrinks the size of the structure from 128 bytes to 120 bytes. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-02-04 15:23:35 -08:00
Jeff King	315ea32f1b	Merge branch 'jk/peel-ref' Speeds up "git upload-pack" (what is invoked by "git fetch" on the other side of the connection) by reducing the cost to advertise the branches and tags that are available in the repository. * jk/peel-ref: upload-pack: use peel_ref for ref advertisements peel_ref: check object type before loading peel_ref: do not return a null sha1 peel_ref: use faster deref_tag_noverify	2012-10-25 06:42:27 -04:00
Jeff King	e6dbffa67b	peel_ref: do not return a null sha1 The idea of the peel_ref function is to dereference tag objects recursively until we hit a non-tag, and return the sha1. Conceptually, it should return 0 if it is successful (and fill in the sha1), or -1 if there was nothing to peel. However, the current behavior is much more confusing. For a regular loose ref, the behavior is as described above. But there is an optimization to reuse the peeled-ref value for a ref that came from a packed-refs file. If we have such a ref, we return its peeled value, even if that peeled value is null (indicating that we know the ref definitely does _not_ peel). It might seem like such information is useful to the caller, who would then know not to bother loading and trying to peel the object. Except that they should not bother loading and trying to peel the object _anyway_, because that fallback is already handled by peel_ref. In other words, the whole point of calling this function is that it handles those details internally, and you either get a sha1, or you know that it is not peel-able. This patch catches the null sha1 case internally and converts it into a -1 return value (i.e., there is nothing to peel). This simplifies callers, which do not need to bother checking themselves. Two callers are worth noting: - in pack-objects, a comment indicates that there is a difference between non-peelable tags and unannotated tags. But that is not the case (before or after this patch). Whether you get a null sha1 has to do with internal details of how peel_ref operated. - in show-ref, if peel_ref returns a failure, the caller tries to decide whether to try peeling manually based on whether the REF_ISPACKED flag is set. But this doesn't make any sense. If the flag is set, that does not necessarily mean the ref came from a packed-refs file with the "peeled" extension. But it doesn't matter, because even if it didn't, there's no point in trying to peel it ourselves, as peel_ref would already have done so. In other words, the fallback peeling is guaranteed to fail. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-10-04 20:34:28 -07:00
Nguyễn Thái Ngọc Duy	4c6881204b	i18n: pack-objects: mark parseopt strings for translation Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-08-20 12:23:18 -07:00
Junio C Hamano	0958a24d73	Merge branch 'jc/sha1-name-more' Teaches the object name parser things like a "git describe" output is always a commit object, "A" in "git log A" must be a committish, and "A" and "B" in "git log A...B" both must be committish, etc., to prolong the lifetime of abbreviated object names. * jc/sha1-name-more: (27 commits) t1512: match the "other" object names t1512: ignore whitespaces in wc -l output rev-parse --disambiguate=<prefix> rev-parse: A and B in "rev-parse A..B" refer to committish reset: the command takes committish commit-tree: the command wants a tree and commits apply: --build-fake-ancestor expects blobs sha1_name.c: add support for disambiguating other types revision.c: the "log" family, except for "show", takes committish revision.c: allow handle_revision_arg() to take other flags sha1_name.c: introduce get_sha1_committish() sha1_name.c: teach lookup context to get_sha1_with_context() sha1_name.c: many short names can only be committish sha1_name.c: get_sha1_1() takes lookup flags sha1_name.c: get_describe_name() by definition groks only commits sha1_name.c: teach get_short_sha1() a commit-only option sha1_name.c: allow get_short_sha1() to take other flags get_sha1(): fix error status regression sha1_name.c: restructure disambiguation of short names sha1_name.c: correct misnamed "canonical" and "res" ...	2012-07-22 12:55:07 -07:00
Junio C Hamano	8e676e8ba5	revision.c: allow handle_revision_arg() to take other flags The existing "cant_be_filename" that tells the function that the caller knows the arg is not a path (hence it does not have to be checked for absense of the file whose name matches it) is made into a bit in the flag word. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-07-09 16:42:22 -07:00
Nguyễn Thái Ngọc Duy	cf2ba13ac6	pack-objects: use streaming interface for reading large loose blobs git usually streams large blobs directly to packs. But there are cases where git can create large loose blobs (unpack-objects or hash-object over pipe). Or they can come from other git implementations. core.bigfilethreshold can also be lowered down and introduce a new wave of large loose blobs. Use streaming interface to read/compress/write these blobs in one go. Fall back to normal way if somehow streaming interface cannot be used. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-05-29 10:50:56 -07:00
Nguyễn Thái Ngọc Duy	c9018b0305	pack-objects: refactor write_object() into helper functions The function first decides if we want to copy data taken from existing pack verbatim or we want to encode the data ourselves for the packfile we are creating and then carries out the decision. Separate the latter phase into two helper functions, one for the case the data is reused, the other for the case the data is produced anew. A little twist is that it can later turn out that we cannot reuse the data after we initially decide to do so; in such a case, the "reuse" helper makes a call to "generate" helper. It is easier to follow than the current fallback code that uses "goto" inside a single large function. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-05-18 14:22:15 -07:00
Nguyễn Thái Ngọc Duy	754980d023	pack-objects, streaming: turn "xx >= big_file_threshold" to ".. > .." This is because all other places do "xx > big_file_threshold" Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-05-18 14:21:19 -07:00
Jeff King	7e52f5660e	gc: do not explode objects which will be immediately pruned When we pack everything into one big pack with "git repack -Ad", any unreferenced objects in to-be-deleted packs are exploded into loose objects, with the intent that they will be examined and possibly cleaned up by the next run of "git prune". Since the exploded objects will receive the mtime of the pack from which they come, if the source pack is old, those loose objects will end up pruned immediately. In that case, it is much more efficient to skip the exploding step entirely for these objects. This patch teaches pack-objects to receive the expiration information and avoid writing these objects out. It also teaches "git gc" to pass the value of gc.pruneexpire to repack (which in turn learns to pass it along to pack-objects) so that this optimization happens automatically during "git gc" and "git gc --auto". Signed-off-by: Jeff King <peff@peff.net> Acked-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-04-11 11:09:49 -07:00
Michał Kiedrowicz	2b34e486bc	pack-objects: Fix compilation with NO_PTHREDS It looks like commit `99fb6e04` (pack-objects: convert to use parse_options(), 2012-02-01) moved the #ifdef NO_PTHREDS around but hasn't noticed that the 'arg' variable no longer is available. Signed-off-by: Michał Kiedrowicz <michal.kiedrowicz@gmail.com> Acked-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-02-26 17:46:00 -08:00
Nguyễn Thái Ngọc Duy	99fb6e04cb	pack-objects: convert to use parse_options() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-02-01 13:05:00 -08:00

1 2 3 4

200 Commits