Commit Graph

16 Commits

Author SHA1 Message Date
Nicolas Pitre
ee7dc310af block-sha1: more good unaligned memory access candidates
In addition to X86, PowerPC and S390 are capable of unaligned memory
accesses.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-13 10:41:02 -07:00
Nicolas Pitre
660231aa97 block-sha1: support for architectures with memory alignment restrictions
This is needed on architectures with poor or non-existent unaligned memory
support and/or no fast byte swap instruction (such as ARM) by using byte
accesses to memory and shifting the result together.

This also makes the code portable, therefore the byte access methods are
the defaults.  Any architecture that properly supports unaligned word
accesses in hardware simply has to enable the alternative methods.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:36:32 -07:00
Nicolas Pitre
dc52fd2973 block-sha1: split the different "hacks" to be individually selected
This is to make it easier for them to be selected individually depending
on the architecture instead of the other way around i.e. having each
architecture select a list of hacks up front.  That makes for clearer
documentation as well.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:35:54 -07:00
Nicolas Pitre
30ba0de726 block-sha1: move code around
Move the code around so specific architecture hacks are defined first.
Also make one line comments actually one line.  No code change.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:32:54 -07:00
Linus Torvalds
926172c5e4 block-sha1: improve code on large-register-set machines
For x86 performance (especially in 32-bit mode) I added that hack to write
the SHA1 internal temporary hash using a volatile pointer, in order to get
gcc to not try to cache the array contents. Because gcc will do all the
wrong things, and then spill things in insane random ways.

But on architectures like PPC, where you have 32 registers, it's actually
perfectly reasonable to put the whole temporary array[] into the register
set, and gcc can do so.

So make the 'volatile unsigned int *' cast be dependent on a
SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just
x86 and x86-64.  With that, the routine is fairly reasonable even when
compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on
a G5:

 * Paulus asm version:       about 3.67s
 * Yours with no change:     about 5.74s
 * Yours without "volatile": about 3.78s

so with this the C version is within about 3% of the asm one.

And add a lot of commentary on what the heck is going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-10 17:26:51 -07:00
Linus Torvalds
66c9c6c0fb block-sha1: improved SHA1 hashing
I think I have found a way to avoid the gcc crazyness.

Lookie here:

	#             TIME[s] SPEED[MB/s]
	rfc3174         5.094       119.8
	rfc3174         5.098       119.7
	linus           1.462       417.5
	linusas         2.008         304
	linusas2        1.878         325
	mozilla         5.566       109.6
	mozillaas       5.866       104.1
	openssl         1.609       379.3
	spelvin         1.675       364.5
	spelvina        1.601       381.3
	nettle          1.591       383.6

notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.

Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.

The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.662       36.73
	rfc3174          1.64       37.22
	linus          0.2523       241.9
	linusas        0.4367       139.8
	linusas2       0.4487         136
	mozilla        0.9704        62.9
	mozillaas      0.9399       64.94

that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-07 22:32:46 -07:00
Linus Torvalds
30d12d4c16 block-sha1: perform register rotation using cpp
Instead of letting the compiler to figure out the optimal way to rotate
register usage, explicitly rotate the register names with cpp.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-07 22:19:45 -07:00
Linus Torvalds
5d5210c35a block-sha1: get rid of redundant 'lenW' context
.. and simplify the ctx->size logic.

We now count the size in bytes, which means that 'lenW' was always just
the low 6 bits of the total size, so we don't carry it around separately
any more.  And we do the 'size in bits' shift at the end.

Suggested by Nicolas Pitre and linux@horizon.com.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds
e869e113c8 block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
It's an equivalent expression, but the '+' gives us some freedom in
instruction selection (for example, we can use 'lea' rather than 'add'),
and associates with the other additions around it to give some minor
scheduling freedom.

Suggested-by: linux@horizon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds
ab14c823df block-sha1: macroize the rounds a bit further
Avoid repeating the shared parts of the different rounds by adding a
macro layer or two. It was already more cpp than C.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds
7b5075fcfb block-sha1: re-use the temporary array as we calculate the SHA1
The mozilla-SHA1 code did this 80-word array for the 80 iterations.  But
the SHA1 state is really just 512 bits, and you can actually keep it in
a kind of "circular queue" of just 16 words instead.

This requires us to do the xor updates as we go along (rather than as a
pre-phase), but that's really what we want to do anyway.

This gets me really close to the OpenSSL performance on my Nehalem.
Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't
strictly even matter on my Nehalem, it's just a local optimization).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds
139e3456ec block-sha1: make the 'ntohl()' part of the first SHA1 loop
This helps a teeny bit.  But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Junio C Hamano
fd536d3439 block-sha1: minor fixups
Bert Wesarg noticed non-x86 version of SHA_ROT() had a typo.
Also spell in-line assembly as __asm__(), otherwise I seem to get
error: implicit declaration of function 'asm' from my compiler.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds
b8e48a89b8 block-sha1: try to use rol/ror appropriately
Use the one with the smaller constant.  It _can_ generate slightly
smaller code (a constant of 1 is special), but perhaps more importantly
it's possibly faster on any uarch that does a rotate with a loop.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Junio C Hamano
b26a9d5089 block-sha1: undo ctx->size change
Undo the change I picked up from the mailing list discussion suggested
by Nico, not because it is wrong, but it will be done at the end of the
follow-up series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:19 -07:00
Linus Torvalds
d7c208a92e Add new optimized C 'block-sha1' routines
Based on the mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.

It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-05 19:28:21 -07:00