Commit Graph

369 Commits

Author SHA1 Message Date
Mathias Krause 7da4b29d49 crypto: aesni - disable "by8" AVX CTR optimization
The "by8" implementation introduced in commit 22cddcc7df ("crypto: aes
- AES CTR x86_64 "by8" AVX optimization") is failing crypto tests as it
handles counter block overflows differently. It only accounts the right
most 32 bit as a counter -- not the whole block as all other
implementations do. This makes it fail the cryptomgr test #4 that
specifically tests this corner case.

As we're quite late in the release cycle, just disable the "by8" variant
for now.

Reported-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-09-24 21:15:31 +08:00
Fengguang Wu 4c1948fc47 crypto: sha-mb - sha1_mb_alg_state can be static
CC: Tim Chen <tim.c.chen@linux.intel.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-26 14:40:52 +08:00
Tim Chen ad61e042e9 crypto: sha-mb - SHA1 multibuffer job manager and glue code
This patch introduces the multi-buffer job manager which is responsible
for submitting scatter-gather buffers from several SHA1 jobs to the
multi-buffer algorithm.  It also contains the flush routine to that's
called by the crypto daemon to complete the job when no new jobs arrive
before the deadline of maximum latency of a SHA1 crypto job.

The SHA1 multi-buffer crypto algorithm is defined and initialized in
this patch.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-25 20:32:30 +08:00
Tim Chen 12d2513d5f crypto: sha-mb - SHA1 multibuffer crypto computation (x8 AVX2)
This patch introduces the assembly routines to do SHA1 computation on
buffers belonging to serveral jobs at once.  The assembly routines are
optimized with AVX2 instructions that have 8 data lanes and using AVX2
registers.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-25 20:32:29 +08:00
Tim Chen 2249cbb53e crypto: sha-mb - SHA1 multibuffer submit and flush routines for AVX2
This patch introduces the routines used to submit and flush buffers
belonging to SHA1 crypto jobs to the SHA1 multibuffer algorithm.  It is
implemented mostly in assembly optimized with AVX2 instructions.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-25 20:32:28 +08:00
Tim Chen 1161777823 crypto: sha-mb - SHA1 multibuffer algorithm data structures
This patch introduces the data structures and prototypes of functions
needed for computing SHA1 hash using multi-buffer.  Included are the
structures of the multi-buffer SHA1 job, job scheduler in C and x86
assembly.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-08-25 20:32:26 +08:00
Linus Torvalds 3e7a716a92 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 - CTR(AES) optimisation on x86_64 using "by8" AVX.
 - arm64 support to ccp
 - Intel QAT crypto driver
 - Qualcomm crypto engine driver
 - x86-64 assembly optimisation for 3DES
 - CTR(3DES) speed test
 - move FIPS panic from module.c so that it only triggers on crypto
   modules
 - SP800-90A Deterministic Random Bit Generator (drbg).
 - more test vectors for ghash.
 - tweak self tests to catch partial block bugs.
 - misc fixes.

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (94 commits)
  crypto: drbg - fix failure of generating multiple of 2**16 bytes
  crypto: ccp - Do not sign extend input data to CCP
  crypto: testmgr - add missing spaces to drbg error strings
  crypto: atmel-tdes - Switch to managed version of kzalloc
  crypto: atmel-sha - Switch to managed version of kzalloc
  crypto: testmgr - use chunks smaller than algo block size in chunk tests
  crypto: qat - Fixed SKU1 dev issue
  crypto: qat - Use hweight for bit counting
  crypto: qat - Updated print outputs
  crypto: qat - change ae_num to ae_id
  crypto: qat - change slice->regions to slice->region
  crypto: qat - use min_t macro
  crypto: qat - remove unnecessary parentheses
  crypto: qat - remove unneeded header
  crypto: qat - checkpatch blank lines
  crypto: qat - remove unnecessary return codes
  crypto: Resolve shadow warnings
  crypto: ccp - Remove "select OF" from Kconfig
  crypto: caam - fix DECO RSR polling
  crypto: qce - Let 'DEV_QCE' depend on both HAS_DMA and HAS_IOMEM
  ...
2014-08-04 09:52:51 -07:00
Jussi Kivilinna cfe82d4f45 crypto: sha512_ssse3 - fix byte count to bit count conversion
Byte-to-bit-count computation is only partly converted to big-endian and is
mixing in CPU-endian values. Problem was noticed by sparce with warning:

  CHECK   arch/x86/crypto/sha512_ssse3_glue.c
arch/x86/crypto/sha512_ssse3_glue.c:144:19: warning: restricted __be64 degrades to integer
arch/x86/crypto/sha512_ssse3_glue.c:144:17: warning: incorrect type in assignment (different base types)
arch/x86/crypto/sha512_ssse3_glue.c:144:17:    expected restricted __be64 <noident>
arch/x86/crypto/sha512_ssse3_glue.c:144:17:    got unsigned long long

Cc: <stable@vger.kernel.org>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-25 21:55:02 +08:00
Jussi Kivilinna 5e50d43d65 crypto: des3_ede-x86_64 - fix parse warning
Patch fixes following sparse warning:

  CHECK   arch/x86/crypto/des3_ede_glue.c
arch/x86/crypto/des3_ede_glue.c:308:52: warning: restricted __be64 degrades to integer
arch/x86/crypto/des3_ede_glue.c:309:52: warning: restricted __be64 degrades to integer
arch/x86/crypto/des3_ede_glue.c:310:52: warning: restricted __be64 degrades to integer
arch/x86/crypto/des3_ede_glue.c:326:44: warning: restricted __be64 degrades to integer

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-25 21:38:43 +08:00
chandramouli narayanan 22cddcc7df crypto: aes - AES CTR x86_64 "by8" AVX optimization
This patch introduces "by8" AES CTR mode AVX optimization inspired by
Intel Optimized IPSEC Cryptograhpic library. For additional information,
please see:
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972

The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
aes_ctr_enc_256_avx_by8() are adapted from
Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
are enabled in a platform, the glue code in AESNI module overrieds the
existing "by4" CTR mode en/decryption with the "by8"
AES CTR mode en/decryption.

On a Haswell desktop, with turbo disabled and all cpus running
at maximum frequency, the "by8" CTR mode optimization
shows better performance results across data & key sizes
as measured by tcrypt.

The average performance improvement of the "by8" version over the "by4"
version is as follows:

For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.

A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
optimization shows the following results:

tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
---------------------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 bytes)

tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
---------------------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1277 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 8745 cycles (8192 bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 348 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 335 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 451 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1030 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6611 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 354 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 346 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 488 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1154 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 8390 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 357 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 362 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 515 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1284 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 8681 cycles (8192 bytes)

crypto: Incorporate feed back to AES CTR mode optimization patch

Specifically, the following:
a) alignment around main loop in aes_ctrby8_avx_x86_64.S
b) .rodata around data constants used in the assembely code.
c) the use of CONFIG_AVX in the glue code.
d) fix up white space.
e) informational message for "by8" AES CTR mode optimization
f) "by8" AES CTR mode optimization can be simply enabled
if the platform supports both AES and AVX features. The
optimization works superbly on Sandybridge as well.

Testing on Haswell shows no performance change since the last.

Testing on Sandybridge shows that the "by8" AES CTR mode optimization
greatly improves performance.

tcrypt log with "by4" AES CTR mode optimization on Sandybridge
--------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 383 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 408 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 707 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1864 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 12813 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 395 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 432 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 780 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 2132 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 15765 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 416 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 438 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 842 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 2383 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 16945 cycles (8192 bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 389 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 409 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 704 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1865 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 12783 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 409 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 434 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 792 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 2151 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 15804 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 421 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 444 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 840 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 2394 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 16928 cycles (8192 bytes)

tcrypt log with "by8" AES CTR mode optimization on Sandybridge
--------------------------------------------------------------

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 383 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 401 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 522 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1136 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7046 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 394 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 418 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 559 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1263 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9072 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 408 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 428 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1385 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 9224 cycles (8192 bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 390 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 402 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 530 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1135 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7079 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 414 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 417 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 572 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1312 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9073 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 415 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 454 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 598 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1407 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 9288 cycles (8192 bytes)

crypto: Fix redundant checks

a) Fix the redundant check for cpu_has_aes
b) Fix the key length check when invoking the CTR mode "by8"
encryptor/decryptor.

crypto: fix typo in AES ctr mode transform

Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Reviewed-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-20 21:27:58 +08:00
Jussi Kivilinna 6574e6c64e crypto: des_3des - add x86-64 assembly implementation
Patch adds x86_64 assembly implementation of Triple DES EDE cipher algorithm.
Two assembly implementations are provided. First is regular 'one-block at
time' encrypt/decrypt function. Second is 'three-blocks at time' function that
gains performance increase on out-of-order CPUs.

tcrypt test results:

Intel Core i5-4570:

des3_ede-asm vs des3_ede-generic:
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     1.21x   1.22x   1.27x   1.36x   1.25x   1.25x
64B     1.98x   1.96x   1.23x   2.04x   2.01x   2.00x
256B    2.34x   2.37x   1.21x   2.40x   2.38x   2.39x
1024B   2.50x   2.47x   1.22x   2.51x   2.52x   2.51x
8192B   2.51x   2.53x   1.21x   2.56x   2.54x   2.55x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-20 21:27:58 +08:00
George Spelvin 473946e674 crypto: crc32c-pclmul - Shrink K_table to 32-bit words
There's no need for the K_table to be made of 64-bit words.  For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit values in there.  They can
all be reduced to 32 bits.

Doing that cuts the table size in half.  Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

This adds (measured on Ivy Bridge) 1 cycle per main loop iteration
(CRC of up to 3K bytes), less than 0.2%.  The hope is that the reduced
D-cache footprint will make up the loss in other code.

Two other related fixes:
* K_table is read-only, so belongs in .rodata, and
* There's no need for more than 8-byte alignment

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: George Spelvin <linux@horizon.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-20 21:27:57 +08:00
Herbert Xu 0ea481466d crypto: ghash-clmulni-intel - Use u128 instead of be128 for internal key
The internal key isn't actually in big-endian format so let's switch
to u128 which also happens to allow us to remove a sparse warning.

Based on suggestion by Ard Biesheuvel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
2014-04-04 21:06:14 +08:00
Ard Biesheuvel 8ceee72808 crypto: ghash-clmulni-intel - use C implementation for setkey()
The GHASH setkey() function uses SSE registers but fails to call
kernel_fpu_begin()/kernel_fpu_end(). Instead of adding these calls, and
then having to deal with the restriction that they cannot be called from
interrupt context, move the setkey() implementation to the C domain.

Note that setkey() does not use any particular SSE features and is not
expected to become a performance bottleneck.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Fixes: 0e1227d356 (crypto: ghash - Add PCLMULQDQ accelerated implementation)
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-04-01 17:22:47 +08:00
Mathias Krause 37b2894717 crypto: x86/sha1 - reduce size of the AVX2 asm implementation
There is really no need to page align sha1_transform_avx2. The default
alignment is just fine. This is not the hot code but only the entry
point, after all.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:43 +08:00
Mathias Krause 6c8c17cc7a crypto: x86/sha1 - fix stack alignment of AVX2 variant
The AVX2 implementation might waste up to a page of stack memory because
of a wrong alignment calculation. This will, in the worst case, increase
the stack usage of sha1_transform_avx2() alone to 5.4 kB -- way to big
for a kernel function. Even worse, it might also allocate *less* bytes
than needed if the stack pointer is already aligned bacause in that case
the 'sub %rbx, %rsp' is effectively moving the stack pointer upwards,
not downwards.

Fix those issues by changing and simplifying the alignment calculation
to use a 32 byte alignment, the alignment really needed.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:43 +08:00
Mathias Krause 6ca5afb8c2 crypto: x86/sha1 - re-enable the AVX variant
Commit 7c1da8d0d0 "crypto: sha - SHA1 transform x86_64 AVX2"
accidentally disabled the AVX variant by making the avx_usable() test
not only fail in case the CPU doesn't support AVX or OSXSAVE but also
if it doesn't support AVX2.

Fix that regression by splitting up the AVX/AVX2 test into two
functions. Also test for the BMI1 extension in the avx2_usable() test
as the AVX2 implementation not only makes use of BMI2 but also BMI1
instructions.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:42 +08:00
chandramouli narayanan 7c1da8d0d0 crypto: sha - SHA1 transform x86_64 AVX2
This git patch adds x86_64 AVX2 optimization of SHA1
transform to crypto support. The patch has been tested with 3.14.0-rc1
kernel.

On a Haswell desktop, with turbo disabled and all cpus running
at maximum frequency, tcrypt shows AVX2 performance improvement
from 3% for 256 bytes update to 16% for 1024 bytes update over
AVX implementation.

This patch adds sha1_avx2_transform(), the glue, build and
configuration changes needed for AVX2 optimization of
SHA1 transform to crypto support.

sha1-ssse3 is one module which adds the necessary optimization
support (SSSE3/AVX/AVX2) for the low-level SHA1 transform function.
With better optimization support, transform function is overridden
as the case may be. In the case of AVX2, due to performance reasons
across datablock sizes, the AVX or AVX2 transform function is used
at run-time as it suits best. The Makefile change therefore appends
the necessary objects to the linkage. Due to this, the patch merely
appends AVX2 transform to the existing build mix and Kconfig support
and leaves the configuration build support as is.

Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-21 21:54:30 +08:00
Dan Carpenter b3bd5869fd crypto: remove a duplicate checks in __cbc_decrypt()
We checked "nbytes < bsize" before so it can't happen here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Acked-by: Johannes Götzfried <johannes.goetzfried@cs.fau.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-02-27 05:56:54 +08:00
Tim Chen 79ba451d66 crypto: aesni - fix build on x86 (32bit)
We rename aesni-intel_avx.S to aesni-intel_avx-x86_64.S to indicate
that it is only used by x86_64 architecture.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-01-15 11:36:34 +08:00
Andy Shevchenko 8610d7bf60 crypto: aesni - fix build on x86 (32bit)
It seems commit d764593a "crypto: aesni - AVX and AVX2 version of AESNI-GCM
encode and decode" breaks a build on x86_32 since it's designed only for
x86_64. This patch makes a compilation unit conditional to CONFIG_64BIT and
functions usage to CONFIG_X86_64.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-31 19:47:46 +08:00
Tim Chen d764593af9 crypto: aesni - AVX and AVX2 version of AESNI-GCM encode and decode
We have added AVX and AVX2 routines that optimize AESNI-GCM encode/decode.
These routines are optimized for encrypt and decrypt of large buffers.
In tests we have seen up to 6% speedup for 1K, 11% speedup for 2K and
18% speedup for 8K buffer over the existing SSE version.  These routines
should provide even better speedup for future Intel x86_64 cpus.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-20 20:06:24 +08:00
Daniel Borkmann fed286110f crypto: arch - use crypto_memneq instead of memcmp
Replace remaining occurences (just as we did in crypto/) under arch/*/crypto/
that make use of memcmp() for comparing keys or authentication tags for
usage with crypto_memneq(). It can simply be used as a drop-in replacement
for the normal memcmp().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: James Yonan <james@openvpn.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-20 20:06:24 +08:00
Oliver Neukum 16c0c4e165 crypto: sha256_ssse3 - also test for BMI2
The AVX2 implementation also uses BMI2 instructions,
but doesn't test for their availability. The assumption
that AVX2 and BMI2 always go together is false. Some
Haswells have AVX2 but not BMI2.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-10-07 14:17:10 +08:00
Ard Biesheuvel 801201aa25 crypto: move x86 to the generic version of ablk_helper
Move all users of ablk_helper under x86/ to the generic version
and delete the x86 specific version.

Acked-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-24 06:02:24 +10:00
Jussi Kivilinna 58497204aa crypto: x86 - restore avx2_supported check
Commit 3d387ef08c (Revert "crypto: blowfish - add AVX2/x86_64 implementation
of blowfish cipher") reverted too much as it removed the 'assembler supports
AVX2' check and therefore disabled remaining AVX2 implementations of Camellia
and Serpent. Patch restores the check and enables these implementations.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-13 21:43:52 +10:00
Jussi Kivilinna 7d444909a2 crypto: sha256_ssse3 - use correct module alias for sha224
Commit a710f761f (crypto: sha256_ssse3 - add sha224 support) attempted to add
MODULE_ALIAS for SHA-224, but it ended up being "sha384", probably because
mix-up with previous commit 340991e30 (crypto: sha512_ssse3 - add sha384
support). Patch corrects module alias to "sha224".

Reported-by: Pierre-Mayeul Badaire <pierre-mayeul.badaire@m4x.org>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-13 21:43:52 +10:00
Herbert Xu 68411521cc Reinstate "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
This patch reinstates commits
	67822649d7
	39761214ee
	0b95a7f857
	31d939625a
	2d31e518a4

Now that module softdeps are in the kernel we can use that to resolve
the boot issue which cause the revert.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-07 12:56:26 +10:00
Herbert Xu eeca9fad52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
Merge upstream tree in order to reinstate crct10dif.
2013-09-07 12:53:35 +10:00
Julia Lawall 2a128b4b74 crypto: camellia-x86-64 - replace commas by semicolons and adjust code alignment
Adjust alignment and replace commas by semicolons in automatically
generated code.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-21 21:08:32 +10:00
Andi Kleen f22d08111a crypto: make tables used from assembler __visible
Tables used from assembler should be marked __visible to let
the compiler know.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-14 20:42:03 +10:00
Linus Torvalds b48a97be8e Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This push fixes a memory corruption issue in caam, as well as
  reverting the new optimised crct10dif implementation as it breaks boot
  on initrd systems.

  Hopefully crct10dif will be reinstated once the supporting code is
  added so that it doesn't break boot"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  Revert "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
  crypto: caam - Fixed the memory out of bound overwrite issue
2013-07-24 11:05:18 -07:00
Herbert Xu e70308ec0e Revert "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
This reverts commits
    67822649d7
    39761214ee
    0b95a7f857
    31d939625a
    2d31e518a4

Unfortunately this change broke boot on some systems that used an
initrd which does not include the newly created crct10dif modules.
As these modules are required by sd_mod under certain configurations
this is a serious problem.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-07-24 17:04:16 +10:00
Linus Torvalds b2c311075d Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 - Do not idle omap device between crypto operations in one session.
 - Added sha224/sha384 shims for SSSE3.
 - More optimisations for camellia-aesni-avx2.
 - Removed defunct blowfish/twofish AVX2 implementations.
 - Added unaligned buffer self-tests.
 - Added PCLMULQDQ optimisation for CRCT10DIF.
 - Added support for Freescale's DCP co-processor
 - Misc fixes.

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (44 commits)
  crypto: testmgr - test hash implementations with unaligned buffers
  crypto: testmgr - test AEADs with unaligned buffers
  crypto: testmgr - test skciphers with unaligned buffers
  crypto: testmgr - check that entries in alg_test_descs are in correct order
  Revert "crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher"
  Revert "crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher"
  crypto: camellia-aesni-avx2 - tune assembly code for more performance
  hwrng: bcm2835 - fix MODULE_LICENSE tag
  hwrng: nomadik - use clk_prepare_enable()
  crypto: picoxcell - replace strict_strtoul() with kstrtoul()
  crypto: dcp - Staticize local symbols
  crypto: dcp - Use NULL instead of 0
  crypto: dcp - Use devm_* APIs
  crypto: dcp - Remove redundant platform_set_drvdata()
  hwrng: use platform_{get,set}_drvdata()
  crypto: omap-aes - Don't idle/start AES device between Encrypt operations
  crypto: crct10dif - Use PTR_RET
  crypto: ux500 - Cocci spatch "resource_size.spatch"
  crypto: sha256_ssse3 - add sha224 support
  crypto: sha512_ssse3 - add sha384 support
  ...
2013-07-05 12:12:33 -07:00
Linus Torvalds 92616ee654 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes an unaligned crash in XTS mode when using aseni_intel"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: aesni_intel - fix accessing of unaligned memory
2013-06-21 06:28:39 -10:00
Herbert Xu 02c0241b60 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto
Merge crypto to resolve conflict in crypto/Kconfig.
2013-06-21 15:13:27 +08:00
Jussi Kivilinna 99f42f937a Revert "crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher"
This reverts commit cf1521a1a5.

Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 8-way twofish/AVX
implementation is therefore faster and this implementation should be removed.

Converting this implementation to use the same method as in twofish/AVX for
table look-ups would give additional ~3% speed up vs twofish/AVX, but would
hardly be worth of the added code and binary size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:29 +08:00
Jussi Kivilinna 3d387ef08c Revert "crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher"
This reverts commit 6048801070.

Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 4-way blowfish
implementation is therefore faster and this implementation should be removed.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:28 +08:00
Jussi Kivilinna acfffdb803 crypto: camellia-aesni-avx2 - tune assembly code for more performance
Add implementation tuned for more performance on real hardware. Changes are
mostly around the part mixing 128-bit extract and insert instructions and
AES-NI instructions. Also 'vpbroadcastb' instructions have been change to
'vpshufb with zero mask'.

Tests on Intel Core i5-4570:

tcrypt ECB results, old-AVX2 vs new-AVX2:

size    128bit key      256bit key
        enc     dec     enc     dec
256     1.00x   1.00x   1.00x   1.00x
1k      1.08x   1.09x   1.05x   1.06x
8k      1.06x   1.06x   1.06x   1.06x

tcrypt ECB results, AVX vs new-AVX2:

size    128bit key      256bit key
        enc     dec     enc     dec
256     1.00x   1.00x   1.00x   1.00x
1k      1.51x   1.50x   1.52x   1.50x
8k      1.47x   1.48x   1.48x   1.48x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:23 +08:00
Jussi Kivilinna fe6510b5d6 crypto: aesni_intel - fix accessing of unaligned memory
The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch changes XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-13 14:57:42 +08:00
Linus Torvalds 484b002e28 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:

 - Three EFI-related fixes

 - Two early memory initialization fixes

 - build fix for older binutils

 - fix for an eager FPU performance regression -- currently we don't
   allow the use of the FPU at interrupt time *at all* in eager mode,
   which is clearly wrong.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Allow FPU to be used at interrupt time even with eagerfpu
  x86, crc32-pclmul: Fix build with older binutils
  x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
  x86, range: fix missing merge during add range
  x86, efi: initial the local variable of DataSize to zero
  efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
  efivarfs: Never return ENOENT from firmware again
2013-05-31 09:44:10 +09:00
Jan Beulich 2baad6121e x86, crc32-pclmul: Fix build with older binutils
binutils prior to 2.18 (e.g. the ones found on SLE10) don't support
assembling PEXTRD, so a macro based approach like the one for PCLMULQDQ
in the same file should be used.

This requires making the helper macros capable of recognizing 32-bit
general purpose register operands.

[ hpa: tagging for stable as it is a low risk build fix ]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/51A6142A02000078000D99D8@nat28.tlf.novell.com
Cc: Alexander Boyko <alexander_boyko@xyratex.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-30 16:36:23 -07:00
Jussi Kivilinna a710f761fc crypto: sha256_ssse3 - add sha224 support
Add sha224 implementation to sha256_ssse3 module.

This also fixes sha256_ssse3 module autoloading issue when 'sha224' is used
before 'sha256'. Previously in such case, just sha256_generic was loaded and
not sha256_ssse3 (since it did not provide sha224). Now if 'sha256' was used
after 'sha224' usage, sha256_ssse3 would remain unloaded.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 15:43:05 +08:00
Jussi Kivilinna 340991e30c crypto: sha512_ssse3 - add sha384 support
Add sha384 implementation to sha512_ssse3 module.

This also fixes sha512_ssse3 module autoloading issue when 'sha384' is used
before 'sha512'. Previously in such case, just sha512_generic was loaded and
not sha512_ssse3 (since it did not provide sha384). Now if 'sha512' was used
after 'sha384' usage, sha512_ssse3 would remain unloaded. For example, this
happens with tcrypt testing module since it tests 'sha384' before 'sha512'.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 15:43:05 +08:00
Jussi Kivilinna de614e561b crypto: sha256_ssse3 - fix stack corruption with SSSE3 and AVX implementations
The _XFER stack element size was set too small, 8 bytes, when it needs to be
16 bytes. As _XFER is the last stack element used by these implementations,
the 16 byte stores with 'movdqa' corrupt the stack where the value of register
%r12 is temporarily stored. As these implementations align the stack pointer
to 16 bytes, this corruption did not happen every time.

Patch corrects this issue.

Reported-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Tested-by: Julian Wollrath <jwollrath@web.de>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 13:46:47 +08:00
Tim Chen 0b95a7f857 crypto: crct10dif - Glue code to cast accelerated CRCT10DIF assembly as a crypto transform
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the
crypto framework.  The config CRYPTO_CRCT10DIF_PCLMUL should be turned
on to enable the feature.  The crc_t10dif crypto library function will
use this faster algorithm when crct10dif_pclmul module is loaded.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-24 17:55:27 +08:00
Tim Chen 31d939625a crypto: crct10dif - Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
instructions.  Details discussing the implementation can be found in the
paper:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-20 20:11:06 +08:00
Jussi Kivilinna f3f935a76a crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of camellia cipher
Patch adds AVX2/AES-NI/x86-64 implementation of Camellia cipher, requiring
32 parallel blocks for input (512 bytes). Compared to AVX implementation, this
version is extended to use the 256-bit wide YMM registers. For AES-NI
instructions data is split to two 128-bit registers and merged afterwards.
Even with this additional handling, performance should be higher compared
to the AES-NI/AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:07 +08:00
Jussi Kivilinna 56d76c96a9 crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel
blocks for input (256 bytes). Implementation is based on the AVX implementation
and extends to use the 256-bit wide YMM registers. Since serpent does not use
table look-ups, this implementation should be close to two times faster than
the AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:07 +08:00
Jussi Kivilinna cf1521a1a5 crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations. Implementation also uses 256-bit wide YMM registers,
which should give additional speed up compared to the AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:05 +08:00
Jussi Kivilinna 6048801070 crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:04 +08:00
Jussi Kivilinna c456a9cd1a crypto: aesni_intel - add more optimized XTS mode for x86-64
Add more optimized XTS code for aesni_intel in 64-bit mode, for smaller stack
usage and boost for speed.

tcrypt results, with Intel i5-2450M:
256-bit key
        enc     dec
16B     0.98x   0.99x
64B     0.64x   0.63x
256B    1.29x   1.32x
1024B   1.54x   1.58x
8192B   1.57x   1.60x

512-bit key
        enc     dec
16B     0.98x   0.99x
64B     0.60x   0.59x
256B    1.24x   1.25x
1024B   1.39x   1.42x
8192B   1.38x   1.42x

I chose not to optimize smaller than block size of 256 bytes, since XTS is
practically always used with data blocks of size 512 bytes. This is why
performance is reduced in tcrypt for 64 byte long blocks.

Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:53 +08:00
Jussi Kivilinna b5c5b072dc crypto: x86/camellia-aesni-avx - add more optimized XTS code
Add more optimized XTS code for camellia-aesni-avx, for smaller stack usage
and small boost for speed.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.10x   1.01x
64B     0.82x   0.77x
256B    1.14x   1.10x
1024B   1.17x   1.16x
8192B   1.10x   1.11x

Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of camellia-2way for block sized smaller than
256 bytes. This causes slower result in tcrypt for 64 bytes.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:52 +08:00
Jussi Kivilinna 70177286e1 crypto: cast6-avx: use new optimized XTS code
Change cast6-avx to use the new XTS code, for smaller stack usage and small
boost to performance.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.01x   1.01x
64B     1.01x   1.00x
256B    1.09x   1.02x
1024B   1.08x   1.06x
8192B   1.08x   1.07x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:52 +08:00
Jussi Kivilinna 18be45270a crypto: x86/twofish-avx - use optimized XTS code
Change twofish-avx to use the new XTS code, for smaller stack usage and small
boost to performance.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.03x   1.02x
64B     0.91x   0.91x
256B    1.10x   1.09x
1024B   1.12x   1.11x
8192B   1.12x   1.11x

Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of twofish-3way for block sized smaller than
128 bytes. This causes slower result in tcrypt for 64 bytes.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:51 +08:00
Jussi Kivilinna a05248ed2d crypto: x86 - add more optimized XTS-mode for serpent-avx
This patch adds AVX optimized XTS-mode helper functions/macros and converts
serpent-avx to use the new facilities. Benefits are slightly improved speed
and reduced stack usage as use of temporary IV-array is avoided.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.00x   1.00x
64B     1.00x   1.00x
256B    1.04x   1.06x
1024B   1.09x   1.09x
8192B   1.10x   1.09x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:51 +08:00
Sandy Wu 57ae1b0532 crypto: crc32-pclmul - Use gas macro for pclmulqdq
Occurs when CONFIG_CRYPTO_CRC32C_INTEL=y and CONFIG_CRYPTO_CRC32C_INTEL=y.
Older versions of bintuils do not support the pclmulqdq instruction. The
PCLMULQDQ gas macro is used instead.

Signed-off-by: Sandy Wu <sandyw@twitter.com>
Cc: stable@vger.kernel.org # 3.8+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:44 +08:00
Tim Chen 87de4579f9 crypto: sha512 - Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.
We added glue code and config options to create crypto
module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:42 +08:00
Tim Chen 5663535b69 crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX2 RORX instruction.
Provides SHA512 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions.  Speedup of 70% or more has been
measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:58 +08:00
Tim Chen e01d69cb01 crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.
Provides SHA512 x86_64 assembly routine optimized with SSE and AVX instructions.
Speedup of 60% or more has been measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:58 +08:00
Tim Chen bf215cee23 crypto: sha512 - Optimized SHA512 x86_64 assembly routine using Supplemental SSE3 instructions.
Provides SHA512 x86_64 assembly routine optimized with SSSE3 instructions.
Speedup of 40% or more has been measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:58 +08:00
Tim Chen 8275d1aa64 crypto: sha256 - Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.
We added glue code and config options to create crypto
module that uses SSE/AVX/AVX2 optimized SHA256 x86_64 assembly routines.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:57 +08:00
Tim Chen d34a460092 crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions
Provides SHA256 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions.  Speedup of 70% or more has been
measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-03 09:06:32 +08:00
Tim Chen ec2b4c851f crypto: sha256 - Optimized sha256 x86_64 assembly routine with AVX instructions.
Provides SHA256 x86_64 assembly routine optimized with SSE and AVX instructions.
Speedup of 60% or more has been measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-03 09:06:32 +08:00
Tim Chen 46d208a2bd crypto: sha256 - Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions.
Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions.
Speedup of 40% or more has been measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-03 09:06:31 +08:00
Jussi Kivilinna 873b9cafa8 crypto: x86 - build AVX block cipher implementations only if assembler supports AVX instructions
These modules require AVX support in assembler, so add new check to Makefile
for this.

Other option would be to use CONFIG_AS_AVX inside source files, but that would
result dummy/empty/no-fuctionality modules being created.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-03 09:06:30 +08:00
Jussi Kivilinna eca1726997 crypto: x86/crc32-pclmul - assembly clean-ups: use ENTRY/ENDPROC
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-03 09:06:29 +08:00
Tim Chen 918731fa28 crypto: crc32c - Update the links to the white papers on CRC32C calculations with PCLMULQDQ instructions.
Herbert,

The following patch update the stale link to the CRC32C white paper
that was referenced.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-03-10 16:46:43 +08:00
Herbert Xu ca81a1a1b8 crypto: crc32c - Kill pointless CRYPTO_CRC32C_X86_64 option
This bool option can never be set to anything other than y.  So
let's just kill it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-02-26 17:52:15 +08:00
Linus Torvalds 32dc43e40a Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "Here is the crypto update for 3.9:

   - Added accelerated implementation of crc32 using pclmulqdq.

   - Added test vector for fcrypt.

   - Added support for OMAP4/AM33XX cipher and hash.

   - Fixed loose crypto_user input checks.

   - Misc fixes"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (43 commits)
  crypto: user - ensure user supplied strings are nul-terminated
  crypto: user - fix empty string test in report API
  crypto: user - fix info leaks in report API
  crypto: caam - Added property fsl,sec-era in SEC4.0 device tree binding.
  crypto: use ERR_CAST
  crypto: atmel-aes - adjust duplicate test
  crypto: crc32-pclmul - Kill warning on x86-32
  crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels
  crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC
  crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets
  crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_*
  crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions
  crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC
  crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions
  crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
  crypto: aesni-intel - add ENDPROC statements for assembler functions
  crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets
  crypto: testmgr - add test vector for fcrypt
  ...
2013-02-25 15:56:15 -08:00
Herbert Xu 7983627657 crypto: crc32-pclmul - Kill warning on x86-32
This patch removes a gratuitous warning on x86-32:

arch/x86/crypto/crc32-pclmul_asm.S:87:2: warning: #warning Using 32bit code support [-Wcpp]

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 18:05:02 +11:00
Jussi Kivilinna d3f5188dfe crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:51 +11:00
Jussi Kivilinna ac9d55dd42 crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:51 +11:00
Jussi Kivilinna 2dcfd44dee crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:50 +11:00
Jussi Kivilinna 044438082c crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_*
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:50 +11:00
Jussi Kivilinna b05d3f3756 crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions
Signed-off-by: Jussi Kivilinna <jussi.kivilinn@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:49 +11:00
Jussi Kivilinna 698a5abbb0 crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:49 +11:00
Jussi Kivilinna 1985fecf01 crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:49 +11:00
Jussi Kivilinna e17e209ea4 crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:48 +11:00
Jussi Kivilinna 59990684b0 crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:48 +11:00
Jussi Kivilinna 5186e395fe crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:48 +11:00
Jussi Kivilinna 8309b745bb crypto: aesni-intel - add ENDPROC statements for assembler functions
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:47 +11:00
Jussi Kivilinna 3f29974383 crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:47 +11:00
Alexander Boyko 78c37d191d crypto: crc32 - add crc32 pclmulqdq implementation and wrappers for table implementation
This patch adds crc32 algorithms to shash crypto api. One is wrapper to
gerneric crc32_le function. Second is crc32 pclmulqdq implementation. It
use hardware provided PCLMULQDQ instruction to accelerate the CRC32 disposal.
This instruction present from Intel Westmere and AMD Bulldozer CPUs.

For intel core i5 I got 450MB/s for table implementation and 2100MB/s
for pclmulqdq implementation.

Signed-off-by: Alexander Boyko <alexander_boyko@xyratex.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-01-20 10:16:45 +11:00
Jussi Kivilinna 0024dc5371 crypto: aesni-intel - remove rfc3686(ctr(aes)), utilize rfc3686 from ctr-module instead
rfc3686 in CTR module is now able of using asynchronous ctr(aes) from
aesni-intel, so rfc3686(ctr(aes)) in aesni-intel is no longer needed.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-01-08 07:04:47 +01:00
Linus Torvalds 1ed55eac3b Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:

 - Added aesni/avx/x86_64 implementations for camellia.

 - Optimised AVX code for cast5/serpent/twofish/cast6.

 - Fixed vmac bug with unaligned input.

 - Allow compression algorithms in FIPS mode.

 - Optimised crc32c implementation for Intel.

 - Misc fixes.

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (32 commits)
  crypto: caam - Updated SEC-4.0 device tree binding for ERA information.
  crypto: testmgr - remove superfluous initializers for xts(aes)
  crypto: testmgr - allow compression algs in fips mode
  crypto: testmgr - add larger crc32c test vector to test FPU path in crc32c_intel
  crypto: testmgr - clean alg_test_null entries in alg_test_descs[]
  crypto: testmgr - remove fips_allowed flag from camellia-aesni null-tests
  crypto: cast5/cast6 - move lookup tables to shared module
  padata: use __this_cpu_read per-cpu helper
  crypto: s5p-sss - Fix compilation error
  crypto: picoxcell - Add terminating entry for platform_device_id table
  crypto: omap-aes - select BLKCIPHER2
  crypto: camellia - add AES-NI/AVX/x86_64 assembler implementation of camellia cipher
  crypto: camellia-x86_64 - share common functions and move structures and function definitions to header file
  crypto: tcrypt - add async speed test for camellia cipher
  crypto: tegra-aes - fix error-valued pointer dereference
  crypto: tegra - fix missing unlock on error case
  crypto: cast5/avx - avoid using temporary stack buffers
  crypto: serpent/avx - avoid using temporary stack buffers
  crypto: twofish/avx - avoid using temporary stack buffers
  crypto: cast6/avx - avoid using temporary stack buffers
  ...
2012-12-15 12:35:19 -08:00
Jussi Kivilinna 044ab52578 crypto: cast5/cast6 - move lookup tables to shared module
CAST5 and CAST6 both use same lookup tables, which can be moved shared module
'cast_common'.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-12-06 17:16:26 +08:00
Jussi Kivilinna d9b1d2e7e1 crypto: camellia - add AES-NI/AVX/x86_64 assembler implementation of camellia cipher
This patch adds AES-NI/AVX/x86_64 assembler implementation of Camellia block
cipher. Implementation process data in sixteen block chunks, which are
byte-sliced and AES SubBytes is reused for Camellia s-box with help of pre-
and post-filtering.

Patch has been tested with tcrypt and automated filesystem tests.

tcrypt test results:

Intel Core i5-2450M:

camellia-aesni-avx vs camellia-asm-x86_64-2way:
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.98x   0.96x   0.99x   0.96x   0.96x   0.95x   0.95x   0.94x   0.97x   0.98x
64B     0.99x   0.98x   1.00x   0.98x   0.98x   0.99x   0.98x   0.93x   0.99x   0.98x
256B    2.28x   2.28x   1.01x   2.29x   2.25x   2.24x   1.96x   1.97x   1.91x   1.90x
1024B   2.57x   2.56x   1.00x   2.57x   2.51x   2.53x   2.19x   2.17x   2.19x   2.22x
8192B   2.49x   2.49x   1.00x   2.53x   2.48x   2.49x   2.17x   2.17x   2.22x   2.22x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.97x   0.98x   0.99x   0.97x   0.97x   0.96x   0.97x   0.98x   0.98x   0.99x
64B     1.00x   1.00x   1.01x   0.99x   0.98x   0.99x   0.99x   0.99x   0.99x   0.99x
256B    2.37x   2.37x   1.01x   2.39x   2.35x   2.33x   2.10x   2.11x   1.99x   2.02x
1024B   2.58x   2.60x   1.00x   2.58x   2.56x   2.56x   2.28x   2.29x   2.28x   2.29x
8192B   2.50x   2.52x   1.00x   2.56x   2.51x   2.51x   2.24x   2.25x   2.26x   2.29x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-11-09 17:32:32 +08:00
Jussi Kivilinna cf582cceda crypto: camellia-x86_64 - share common functions and move structures and function definitions to header file
Prepare camellia-x86_64 functions to be reused from AVX/AESNI implementation
module.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-11-09 17:32:31 +08:00
Jussi Kivilinna c12ab20b16 crypto: cast5/avx - avoid using temporary stack buffers
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.

ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~5%.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-24 21:10:55 +08:00
Jussi Kivilinna facd416fbc crypto: serpent/avx - avoid using temporary stack buffers
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.

ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~3%.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-24 21:10:55 +08:00
Jussi Kivilinna 8435a3c300 crypto: twofish/avx - avoid using temporary stack buffers
Introduce new assembler functions to avoid use temporary stack buffers in glue
code. This also allows use of vector instructions for xoring output in CTR and
CBC modes and construction of IVs for CTR mode.

ECB mode sees ~0.2% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~3%.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-24 21:10:55 +08:00
Jussi Kivilinna cba1cce054 crypto: cast6/avx - avoid using temporary stack buffers
Introduce new assembler functions to avoid use temporary stack buffers in
glue code. This also allows use of vector instructions for xoring output
in CTR and CBC modes and construction of IVs for CTR mode.

ECB mode sees ~0.5% decrease in speed because added one extra function
call. CBC mode decryption and CTR mode benefit from vector operations
and gain ~2%.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-24 21:10:54 +08:00
Jussi Kivilinna 58990986f1 crypto: x86/glue_helper - use le128 instead of u128 for CTR mode
'u128' currently used for CTR mode is on little-endian 'long long' swapped
and would require extra swap operations by SSE/AVX code. Use of le128
instead of u128 allows IV calculations to be done with vector registers
easier.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-24 21:10:54 +08:00
Jussi Kivilinna 32bec973a8 crypto: aesni - fix XTS mode on x86-32, add wrapper function for asmlinkage aesni_enc()
Calling convention for internal functions and 'asmlinkage' functions is
different on x86-32. Therefore do not directly cast aesni_enc as XTS tweak
function, but use wrapper function in between. Fixes crash with "XTS +
aesni_intel + x86-32" combination.

Cc: stable@vger.kernel.org
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-18 14:01:33 -07:00
Tim Chen 6a8ce1ef39 crypto: crc32c - Optimize CRC32C calculation with PCLMULQDQ instruction
This patch adds the crc_pcl function that calculates CRC32C checksum using the
PCLMULQDQ instruction on processors that support this feature. This will
provide speedup over using CRC32 instruction only.
The usage of PCLMULQDQ necessitate the invocation of kernel_fpu_begin and
kernel_fpu_end and incur some overhead.  So the new crc_pcl function is only
invoked for buffer size of 512 bytes or more.  Larger sized
buffers will expect to see greater speedup.  This feature is best used coupled
with eager_fpu which reduces the kernel_fpu_begin/end overhead.  For
buffer size of 1K the speedup is around 1.6x and for buffer size greater than
4K, the speedup is around 3x compared to original implementation in crc32c-intel
module. Test was performed on Sandy Bridge based platform with constant frequency
set for cpu.

A white paper detailing the algorithm can be found here:
http://download.intel.com/design/intarch/papers/323405.pdf

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-15 22:18:24 +08:00
Tim Chen 35b80920d4 crypto: crc32c - Rename crc32c-intel.c to crc32c-intel_glue.c
This patch renames the crc32c-intel.c file to crc32c-intel_glue.c file
in preparation for linking with the new crc32c-pcl-intel-asm.S file,
which contains optimized crc32c calculation based on PCLMULQDQ
instruction.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-15 22:18:22 +08:00
Jussi Kivilinna c9f97a27ce crypto: x86/glue_helper - fix storing of new IV in CBC encryption
Glue_helper incorrectly XORs new IV over old IV at end of CBC encryption
function when it should store. This causes CBC encryption to give
incorrect output on multi-page encryption requests.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-10-04 17:45:29 +08:00
Jussi Kivilinna 200429cc63 crypto: cast5/avx - fix storing of new IV in CBC encryption
cast5/avx incorrectly XORs new IV over old IV at end of CBC encryption
function when it should store. This causes CBC encryption to give
incorrect output on multi-page encryption requests.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-09-27 15:50:40 +08:00
Jussi Kivilinna 1ffb72a39a crypto: camellia-x86_64 - fix sparse warnings (constant is so big)
Fix "constant 0xXXXXXXXXXXXXXXXX is so big it's unsigned long" sparse warnings.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-09-07 04:17:05 +08:00
Jussi Kivilinna c09220e1bc crypto: cast6-avx - tune assembler code for more performance
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies, interleaves instructions better for out-of-order scheduling
and merges constant 16-bit rotation with round-key variable rotation.

tcrypt ECB results:

Intel Core i5-2450M:

size    old-vs-new      new-vs-generic  old-vs-generic
        enc     dec     enc     dec     enc     dec
256     1.13x   1.19x   2.05x   2.17x   1.82x   1.82x
1k      1.18x   1.21x   2.26x   2.33x   1.93x   1.93x
8k      1.19x   1.19x   2.32x   2.33x   1.95x   1.95x

[v2]
 - Do instruction interleaving another way to avoid adding new FPU<=>CPU
   register moves as these cause performance drop on Bulldozer.
 - Improvements to round-key variable rotation handling.
 - Further interleaving improvements for better out-of-order scheduling.

Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-09-07 04:17:05 +08:00
Jussi Kivilinna ddaea7869d crypto: cast5-avx - tune assembler code for more performance
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies, interleaves instructions better for out-of-order scheduling
and merges constant 16-bit rotation with round-key variable rotation.

tcrypt ECB results (128bit key):

Intel Core i5-2450M:

size    old-vs-new      new-vs-generic  old-vs-generic
        enc     dec     enc     dec     enc     dec
256     1.18x   1.18x   2.45x   2.47x   2.08x   2.10x
1k      1.20x   1.20x   2.73x   2.73x   2.28x   2.28x
8k      1.20x   1.19x   2.73x   2.73x   2.28x   2.29x

[v2]
 - Do instruction interleaving another way to avoid adding new FPU<=>CPU
   register moves as these cause performance drop on Bulldozer.
 - Improvements to round-key variable rotation handling.
 - Further interleaving improvements for better out-of-order scheduling.

Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-09-07 04:17:04 +08:00
Jussi Kivilinna f94a73f8dd crypto: twofish-avx - tune assembler code for more performance
Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.

Tested on Intel Core i5-2450M and AMD FX-8100.

tcrypt ECB results:

Intel Core i5-2450M:

size    old-vs-new      new-vs-3way     old-vs-3way
        enc     dec     enc     dec     enc     dec
256     1.12x   1.13x   1.36x   1.37x   1.21x   1.22x
1k      1.14x   1.14x   1.48x   1.49x   1.29x   1.31x
8k      1.14x   1.14x   1.50x   1.52x   1.32x   1.33x

AMD FX-8100:

size    old-vs-new      new-vs-3way     old-vs-3way
        enc     dec     enc     dec     enc     dec
256     1.10x   1.11x   1.01x   1.01x   0.92x   0.91x
1k      1.11x   1.12x   1.08x   1.07x   0.97x   0.96x
8k      1.11x   1.13x   1.10x   1.08x   0.99x   0.97x

[v2]
 - Do instruction interleaving another way to avoid adding new FPU<=>CPU
   register moves as these cause performance drop on Bulldozer.
 - Further interleaving improvements for better out-of-order scheduling.

Tested-by: Borislav Petkov <bp@alien8.de>
Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-09-07 04:17:04 +08:00
Jussi Kivilinna 023af60825 crypto: aesni_intel - improve lrw and xts performance by utilizing parallel AES-NI hardware pipelines
Use parallel LRW and XTS encryption facilities to better utilize AES-NI
hardware pipelines and gain extra performance.

Tcrypt benchmark results (async), old vs new ratios:

Intel Core i5-2450M CPU (fam: 6, model: 42, step: 7)

aes:128bit
        lrw:256bit      xts:256bit
size    lrw-enc lrw-dec xts-dec xts-dec
16B     0.99x   1.00x   1.22x   1.19x
64B     1.38x   1.50x   1.58x   1.61x
256B    2.04x   2.02x   2.27x   2.29x
1024B   2.56x   2.54x   2.89x   2.92x
8192B   2.85x   2.99x   3.40x   3.23x

aes:192bit
        lrw:320bit      xts:384bit
size    lrw-enc lrw-dec xts-dec xts-dec
16B     1.08x   1.08x   1.16x   1.17x
64B     1.48x   1.54x   1.59x   1.65x
256B    2.18x   2.17x   2.29x   2.28x
1024B   2.67x   2.67x   2.87x   3.05x
8192B   2.93x   2.84x   3.28x   3.33x

aes:256bit
        lrw:348bit      xts:512bit
size    lrw-enc lrw-dec xts-dec xts-dec
16B     1.07x   1.07x   1.18x   1.19x
64B     1.56x   1.56x   1.70x   1.71x
256B    2.22x   2.24x   2.46x   2.46x
1024B   2.76x   2.77x   3.13x   3.05x
8192B   2.99x   3.05x   3.40x   3.30x

Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Reviewed-by: Kim Phillips <kim.phillips@freescale.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-08-20 16:28:10 +08:00
Johannes Goetzfried 4ea1277d30 crypto: cast6 - add x86_64/avx assembler implementation
This patch adds a x86_64/avx assembler implementation of the Cast6 block
cipher. The implementation processes eight blocks in parallel (two 4 block
chunk AVX operations). The table-lookups are done in general-purpose registers.
For small blocksizes the functions from the generic module are called. A good
performance increase is provided for blocksizes greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

cast6-avx-x86_64 vs. cast6-generic
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.97x   1.00x   1.01x   1.01x   0.99x   0.97x   0.98x   1.01x   0.96x   0.98x
64B     0.98x   0.99x   1.02x   1.01x   0.99x   1.00x   1.01x   0.99x   1.00x   0.99x
256B    1.77x   1.84x   0.99x   1.85x   1.77x   1.77x   1.70x   1.74x   1.69x   1.72x
1024B   1.93x   1.95x   0.99x   1.96x   1.93x   1.93x   1.84x   1.85x   1.89x   1.87x
8192B   1.91x   1.95x   0.99x   1.97x   1.95x   1.91x   1.86x   1.87x   1.93x   1.90x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.97x   0.99x   1.02x   1.01x   0.98x   0.99x   1.00x   1.00x   0.98x   0.98x
64B     0.98x   0.99x   1.01x   1.00x   1.00x   1.00x   1.01x   1.01x   0.97x   1.00x
256B    1.77x   1.83x   1.00x   1.86x   1.79x   1.78x   1.70x   1.76x   1.71x   1.69x
1024B   1.92x   1.95x   0.99x   1.96x   1.93x   1.93x   1.83x   1.86x   1.89x   1.87x
8192B   1.94x   1.95x   0.99x   1.97x   1.95x   1.95x   1.87x   1.87x   1.93x   1.91x

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-08-01 17:47:30 +08:00
Johannes Goetzfried 4d6d6a2c85 crypto: cast5 - add x86_64/avx assembler implementation
This patch adds a x86_64/avx assembler implementation of the Cast5 block
cipher. The implementation processes sixteen blocks in parallel (four 4 block
chunk AVX operations). The table-lookups are done in general-purpose registers.
For small blocksizes the functions from the generic module are called. A good
performance increase is provided for blocksizes greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

cast5-avx-x86_64 vs. cast5-generic
64bit key:
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     0.99x   0.99x   1.00x   1.00x   1.02x   1.01x
64B     1.00x   1.00x   0.98x   1.00x   1.01x   1.02x
256B    2.03x   2.01x   0.95x   2.11x   2.12x   2.13x
1024B   2.30x   2.24x   0.95x   2.29x   2.35x   2.35x
8192B   2.31x   2.27x   0.95x   2.31x   2.39x   2.39x

128bit key:
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     0.99x   0.99x   1.00x   1.00x   1.01x   1.01x
64B     1.00x   1.00x   0.98x   1.01x   1.02x   1.01x
256B    2.17x   2.13x   0.96x   2.19x   2.19x   2.19x
1024B   2.29x   2.32x   0.95x   2.34x   2.37x   2.38x
8192B   2.35x   2.32x   0.95x   2.35x   2.39x   2.39x

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-08-01 17:47:30 +08:00
Jussi Kivilinna 7af6c24568 crypto: arch/x86 - cleanup - remove unneeded crypto_alg.cra_list initializations
Initialization of cra_list is currently mixed, most ciphers initialize this
field and most shashes do not. Initialization however is not needed at all
since cra_list is initialized/overwritten in __crypto_register_alg() with
list_add(). Therefore perform cleanup to remove all unneeded initializations
of this field in 'arch/x86/crypto/'.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-08-01 17:47:27 +08:00
Johannes Goetzfried a43478863b crypto: twofish-avx - remove useless instruction
The register %rdx is written, but never read till the end of the encryption
routine. Therefore let's delete the useless instruction.

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-07-11 11:08:30 +08:00
Milan Broz bf084d8f6e crypto: aesni-intel - fix wrong kfree pointer
kfree(new_key_mem) in rfc4106_set_key() should be called on malloced pointer,
not on aligned one, otherwise it can cause invalid pointer on free.

(Seen at least once when running tcrypt tests with debug kernel.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-07-11 11:06:13 +08:00
Jussi Kivilinna 70ef2601fe crypto: move arch/x86/include/asm/aes.h to arch/x86/include/asm/crypto/
Move AES header to the new asm/crypto directory.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:03 +08:00
Jussi Kivilinna d4af0e9d6e crypto: move arch/x86/include/asm/serpent-{sse2|avx}.h to arch/x86/include/asm/crypto/
Move serpent crypto headers to the new asm/crypto/ directory.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:02 +08:00
Jussi Kivilinna a7378d4e55 crypto: twofish-avx - remove duplicated glue code and use shared glue code from glue_helper
Now that shared glue code is available, convert twofish-avx to use it.

Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:02 +08:00
Jussi Kivilinna 414cb5e7cc crypto: twofish-x86_64-3way - remove duplicated glue code and use shared glue code from glue_helper
Now that shared glue code is available, convert twofish-x86_64-3way to use it.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:02 +08:00
Jussi Kivilinna 964263afdc crypto: camellia-x86_64 - remove duplicated glue code and use shared glue code from glue_helper
Now that shared glue code is available, convert camellia-x86_64 to use it.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:02 +08:00
Jussi Kivilinna 1d0debbd46 crypto: serpent-avx: remove duplicated glue code and use shared glue code from glue_helper
Now that shared glue code is available, convert serpent-avx to use it.

Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:01 +08:00
Jussi Kivilinna 596d875052 crypto: serpent-sse2 - split generic glue code to new helper module
Now that serpent-sse2 glue code has been made generic, it can be split to
separate module.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:01 +08:00
Jussi Kivilinna e81792fbc2 crypto: serpent-sse2 - prepare serpent-sse2 glue code into generic x86 glue code for 128bit block ciphers
Block cipher implementations in arch/x86/crypto/ contain common glue code that
is currently duplicated in each module (camellia-x86_64, twofish-x86_64-3way,
twofish-avx, serpent-sse2 and serpent-avx). This patch prepares serpent-sse2
glue into generic glue code for all 128bit block ciphers to use in
arch/x86/crypto.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:01 +08:00
Jussi Kivilinna a9629d7142 crypto: aes_ni - change to use shared ablk_* functions
Remove duplicate ablk_* functions and make use of ablk_helper module instead.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:01 +08:00
Jussi Kivilinna 30a0400882 crypto: twofish-avx - change to use shared ablk_* functions
Remove duplicate ablk_* functions and make use of ablk_helper module instead.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:01 +08:00
Jussi Kivilinna ffaf915632 crypto: ablk_helper - move ablk_* functions from serpent-sse2/avx glue code to shared module
Move ablk-* functions to separate module to share common code between cipher
implementations.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-27 14:42:00 +08:00
Jussi Kivilinna 3387e7d690 crypto: serpent-sse2/avx - allow both to be built into kernel
Rename serpent-avx assembler functions so that they do not collide with
serpent-sse2 assembler functions when linking both versions in to same
kernel image.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-14 10:09:03 +08:00
Johannes Goetzfried 7efe407672 crypto: serpent - add x86_64/avx assembler implementation
This patch adds a x86_64/avx assembler implementation of the Serpent block
cipher. The implementation is very similar to the sse2 implementation and
processes eight blocks in parallel. Because of the new non-destructive three
operand syntax all move-instructions can be removed and therefore a little
performance increase is provided.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

serpent-avx-x86_64 vs. serpent-sse2-x86_64
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.03x   1.01x   1.01x   1.01x   1.00x   1.00x   1.00x   1.00x   1.00x   1.01x
64B     1.00x   1.00x   1.00x   1.00x   1.00x   0.99x   1.00x   1.01x   1.00x   1.00x
256B    1.05x   1.03x   1.00x   1.02x   1.05x   1.06x   1.05x   1.02x   1.05x   1.02x
1024B   1.05x   1.02x   1.00x   1.02x   1.05x   1.06x   1.05x   1.03x   1.05x   1.02x
8192B   1.05x   1.02x   1.00x   1.02x   1.06x   1.06x   1.04x   1.03x   1.04x   1.02x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.01x   1.00x   1.01x   1.01x   1.00x   1.00x   0.99x   1.03x   1.01x   1.01x
64B     1.00x   1.00x   1.00x   1.00x   1.00x   1.00x   1.00x   1.01x   1.00x   1.02x
256B    1.05x   1.02x   1.00x   1.02x   1.05x   1.02x   1.04x   1.05x   1.05x   1.02x
1024B   1.06x   1.02x   1.00x   1.02x   1.07x   1.06x   1.05x   1.04x   1.05x   1.02x
8192B   1.05x   1.02x   1.00x   1.02x   1.06x   1.06x   1.04x   1.05x   1.05x   1.02x

serpent-avx-x86_64 vs aes-asm (8kB block):
         128bit  256bit
ecb-enc  1.26x   1.73x
ecb-dec  1.20x   1.64x
cbc-enc  0.33x   0.45x
cbc-dec  1.24x   1.67x
ctr-enc  1.32x   1.76x
ctr-dec  1.32x   1.76x
lrw-enc  1.20x   1.60x
lrw-dec  1.15x   1.54x
xts-enc  1.22x   1.64x
xts-dec  1.17x   1.57x

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-12 16:47:43 +08:00
Johannes Goetzfried 107778b592 crypto: twofish - add x86_64/avx assembler implementation
This patch adds a x86_64/avx assembler implementation of the Twofish block
cipher. The implementation processes eight blocks in parallel (two 4 block
chunk AVX operations). The table-lookups are done in general-purpose registers.
For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
module are called. A good performance increase is provided for blocksizes
greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

twofish-avx-x86_64 vs. twofish-x86_64-3way
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.96x   0.97x   1.00x   0.95x   0.97x   0.97x   0.96x   0.95x   0.95x   0.98x
64B     0.99x   0.99x   1.00x   0.99x   0.98x   0.98x   0.99x   0.98x   0.99x   0.98x
256B    1.20x   1.21x   1.00x   1.19x   1.15x   1.14x   1.19x   1.20x   1.18x   1.19x
1024B   1.29x   1.30x   1.00x   1.28x   1.23x   1.24x   1.26x   1.28x   1.26x   1.27x
8192B   1.31x   1.32x   1.00x   1.31x   1.25x   1.25x   1.28x   1.29x   1.28x   1.30x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     0.96x   0.96x   1.00x   0.96x   0.97x   0.98x   0.95x   0.95x   0.95x   0.96x
64B     1.00x   0.99x   1.00x   0.98x   0.98x   1.01x   0.98x   0.98x   0.98x   0.98x
256B    1.20x   1.21x   1.00x   1.21x   1.15x   1.15x   1.19x   1.20x   1.18x   1.19x
1024B   1.29x   1.30x   1.00x   1.28x   1.23x   1.23x   1.26x   1.27x   1.26x   1.27x
8192B   1.31x   1.33x   1.00x   1.31x   1.26x   1.26x   1.29x   1.29x   1.28x   1.30x

twofish-avx-x86_64 vs aes-asm (8kB block):
         128bit  256bit
ecb-enc  1.19x   1.63x
ecb-dec  1.18x   1.62x
cbc-enc  0.75x   1.03x
cbc-dec  1.23x   1.67x
ctr-enc  1.24x   1.65x
ctr-dec  1.24x   1.65x
lrw-enc  1.15x   1.53x
lrw-dec  1.14x   1.52x
xts-enc  1.16x   1.56x
xts-dec  1.16x   1.56x

Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-12 16:46:07 +08:00
Mathias Krause 65df577439 crypto: sha1 - use Kbuild supplied flags for AVX test
Commit ea4d26ae ("raid5: add AVX optimized RAID5 checksumming")
introduced x86/ arch wide defines for AFLAGS and CFLAGS indicating AVX
support in binutils based on the same test we have in x86/crypto/ right
now. To minimize duplication drop our implementation in favour to the
one in x86/.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-06-12 16:37:16 +08:00
Mathias Krause 7c8d51848a crypto: aesni-intel - fix unaligned cbc decrypt for x86-32
The 32 bit variant of cbc(aes) decrypt is using instructions requiring
128 bit aligned memory locations but fails to ensure this constraint in
the code. Fix this by loading the data into intermediate registers with
load unaligned instructions.

This fixes reported general protection faults related to aesni.

References: https://bugzilla.kernel.org/show_bug.cgi?id=43223
Reported-by: Daniel <garkein@mailueberfall.de>
Cc: stable@kernel.org [v2.6.39+]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-05-31 20:53:22 +10:00
Jussi Kivilinna ef45b83431 crypto: aesni-intel - move more common code to ablk_init_common
ablk_*_init functions share more common code than what is currently in
ablk_init_common. Move all of the common code to ablk_init_common.

Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-05-15 17:25:33 +10:00
Jussi Kivilinna fa46ccb8eb crypto: aesni-intel - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-05-15 17:25:33 +10:00
Linus Torvalds 6a76a69923 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes a build problem where two crypto modules both try to export
  the same symbols (which shouldn't have been exported in the first
  place)."

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: twofish-x86_64-3way - module init/exit functions should be static
  crypto: camellia-x86_64 - module init/exit functions should be static
2012-03-22 20:19:30 -07:00
Jussi Kivilinna ff0a70fe05 crypto: twofish-x86_64-3way - module init/exit functions should be static
This caused conflict with camellia-x86_64 when compiled into kernel, same
function names and not static.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-03-22 09:17:45 +08:00
Jussi Kivilinna 676a38046f crypto: camellia-x86_64 - module init/exit functions should be static
This caused conflict with twofish-x86_64-3way when compiled into kernel,
same function names and not static.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-03-22 09:17:44 +08:00
Linus Torvalds b8716614a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "* sha512 bug fixes (already in your tree).
  * SHA224/SHA384 AEAD support in caam.
  * X86-64 optimised version of Camellia.
  * Tegra AES support.
  * Bulk algorithm registration interface to make driver registration easier.
  * padata race fixes.
  * Misc fixes."

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits)
  padata: Fix race on sequence number wrap
  padata: Fix race in the serialization path
  crypto: camellia - add assembler implementation for x86_64
  crypto: camellia - rename camellia.c to camellia_generic.c
  crypto: camellia - fix checkpatch warnings
  crypto: camellia - rename camellia module to camellia_generic
  crypto: tcrypt - add more camellia tests
  crypto: testmgr - add more camellia test vectors
  crypto: camellia - simplify key setup and CAMELLIA_ROUNDSM macro
  crypto: twofish-x86_64/i586 - set alignmask to zero
  crypto: blowfish-x86_64 - set alignmask to zero
  crypto: serpent-sse2 - combine ablk_*_init functions
  crypto: blowfish-x86_64 - use crypto_[un]register_algs
  crypto: twofish-x86_64-3way - use crypto_[un]register_algs
  crypto: serpent-sse2 - use crypto_[un]register_algs
  crypto: serpent-sse2 - remove dead code from serpent_sse2_glue.c::serpent_sse2_init()
  crypto: twofish-x86 - Remove dead code from twofish_glue_3way.c::init()
  crypto: In crypto_add_alg(), 'exact' wants to be initialized to 0
  crypto: caam - fix gcc 4.6 warning
  crypto: Add bulk algorithm registration interface
  ...
2012-03-21 13:20:43 -07:00
Linus Torvalds 9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Cong Wang 8fd75e1216 x86: remove the second argument of k[un]map_atomic()
Acked-by: Avi Kivity <avi@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:15 +08:00
Jussi Kivilinna 0b95ec56ae crypto: camellia - add assembler implementation for x86_64
Patch adds x86_64 assembler implementation of Camellia block cipher. Two set of
functions are provided. First set is regular 'one-block at time' encrypt/decrypt
functions. Second is 'two-block at time' functions that gain performance increase
on out-of-order CPUs. Performance of 2-way functions should be equal to 1-way
functions with in-order CPUs.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

AMD Phenom II 1055T (fam:16, model:10):

camellia-asm vs camellia_generic:
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.27x   1.22x   1.30x   1.42x   1.30x   1.34x   1.19x   1.05x   1.23x   1.24x
64B     1.74x   1.79x   1.43x   1.87x   1.81x   1.87x   1.48x   1.38x   1.55x   1.62x
256B    1.90x   1.87x   1.43x   1.94x   1.94x   1.95x   1.63x   1.62x   1.67x   1.70x
1024B   1.96x   1.93x   1.43x   1.95x   1.98x   2.01x   1.67x   1.69x   1.74x   1.80x
8192B   1.96x   1.96x   1.39x   1.93x   2.01x   2.03x   1.72x   1.64x   1.71x   1.76x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.23x   1.23x   1.33x   1.39x   1.34x   1.38x   1.04x   1.18x   1.21x   1.29x
64B     1.72x   1.69x   1.42x   1.78x   1.81x   1.89x   1.57x   1.52x   1.56x   1.65x
256B    1.85x   1.88x   1.42x   1.86x   1.93x   1.96x   1.69x   1.65x   1.70x   1.75x
1024B   1.88x   1.86x   1.45x   1.95x   1.96x   1.95x   1.77x   1.71x   1.77x   1.78x
8192B   1.91x   1.86x   1.42x   1.91x   2.03x   1.98x   1.73x   1.71x   1.78x   1.76x

camellia-asm vs aes-asm (8kB block):
         128bit  256bit
ecb-enc  1.15x   1.22x
ecb-dec  1.16x   1.16x
cbc-enc  0.85x   0.90x
cbc-dec  1.20x   1.23x
ctr-enc  1.28x   1.30x
ctr-dec  1.27x   1.28x
lrw-enc  1.12x   1.16x
lrw-dec  1.08x   1.10x
xts-enc  1.11x   1.15x
xts-dec  1.14x   1.15x

Intel Core2 T8100 (fam:6, model:23, step:6):

camellia-asm vs camellia_generic:
128bit key:                                             (lrw:256bit)    (xts:256bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.10x   1.12x   1.14x   1.16x   1.16x   1.15x   1.02x   1.02x   1.08x   1.08x
64B     1.61x   1.60x   1.17x   1.68x   1.67x   1.66x   1.43x   1.42x   1.44x   1.42x
256B    1.65x   1.73x   1.17x   1.77x   1.81x   1.80x   1.54x   1.53x   1.58x   1.54x
1024B   1.76x   1.74x   1.18x   1.80x   1.85x   1.85x   1.60x   1.59x   1.65x   1.60x
8192B   1.77x   1.75x   1.19x   1.81x   1.85x   1.86x   1.63x   1.61x   1.66x   1.62x

256bit key:                                             (lrw:384bit)    (xts:512bit)
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B     1.10x   1.07x   1.13x   1.16x   1.11x   1.16x   1.03x   1.02x   1.08x   1.07x
64B     1.61x   1.62x   1.15x   1.66x   1.63x   1.68x   1.47x   1.46x   1.47x   1.44x
256B    1.71x   1.70x   1.16x   1.75x   1.69x   1.79x   1.58x   1.57x   1.59x   1.55x
1024B   1.78x   1.72x   1.17x   1.75x   1.80x   1.80x   1.63x   1.62x   1.65x   1.62x
8192B   1.76x   1.73x   1.17x   1.78x   1.80x   1.81x   1.64x   1.62x   1.68x   1.64x

camellia-asm vs aes-asm (8kB block):
         128bit  256bit
ecb-enc  1.17x   1.21x
ecb-dec  1.17x   1.20x
cbc-enc  0.80x   0.82x
cbc-dec  1.22x   1.24x
ctr-enc  1.25x   1.26x
ctr-dec  1.25x   1.26x
lrw-enc  1.14x   1.18x
lrw-dec  1.13x   1.17x
xts-enc  1.14x   1.18x
xts-dec  1.14x   1.17x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-03-14 17:25:56 +08:00
Jussi Kivilinna 8940426489 crypto: twofish-x86_64/i586 - set alignmask to zero
x86 has fast unaligned accesses, so twofish-x86_64/i586 does not need to enforce
alignment.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:24 +08:00
Jussi Kivilinna 919e2c3249 crypto: blowfish-x86_64 - set alignmask to zero
x86 has fast unaligned accesses, so blowfish-x86_64 does not need to enforce
alignment.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:24 +08:00
Jussi Kivilinna 435d3e51af crypto: serpent-sse2 - combine ablk_*_init functions
Driver name in ablk_*_init functions can be constructed runtime. Therefore
use single function ablk_init to reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:23 +08:00
Jussi Kivilinna d433208cfc crypto: blowfish-x86_64 - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:23 +08:00
Jussi Kivilinna 53709ddee3 crypto: twofish-x86_64-3way - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:22 +08:00
Jussi Kivilinna 35474c3bb7 crypto: serpent-sse2 - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:22 +08:00
Jesper Juhl 6e77fe8c11 crypto: serpent-sse2 - remove dead code from serpent_sse2_glue.c::serpent_sse2_init()
We cannot reach the line after 'return err'. Remove it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-14 16:34:19 +08:00
Jesper Juhl 8d21190e22 crypto: twofish-x86 - Remove dead code from twofish_glue_3way.c::init()
We can never reach the line just after the 'return 0'
statement. Remove it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-14 16:34:18 +08:00
Andi Kleen 3bd391f056 crypto: Add support for x86 cpuid auto loading for x86 crypto drivers
Add support for auto-loading of crypto drivers based on cpuid features.
This enables auto-loading of the VIA and Intel specific drivers
for AES, hashing and CRCs.

Requires the earlier infrastructure patch to add x86 modinfo.
I kept it all in a single patch for now.

I dropped the printks when the driver cpuid doesn't match (imho
drivers never should print anything in such a case)

One drawback is that udev doesn't know if the drivers are used or not,
so they will be unconditionally loaded at boot up. That's better
than not loading them at all, like it often happens.

Cc: Dave Jones <davej@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jen Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:48:10 -08:00
Jussi Kivilinna 847cb7ef56 crypto: serpent-sse2 - change transpose_4x4 to only use integer instructions
Matrix transpose macro in serpent-sse2 uses mix of SSE2 integer and SSE floating
point instructions, which might cause performance penality on some CPUs.

This patch replaces transpose_4x4 macro with version that uses only SSE2
integer instructions.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-01-13 16:38:40 +11:00
Jussi Kivilinna 4c58464b80 crypto: blowfish-x86_64 - blacklist Pentium 4
Implementation in blowfish-x86_64 uses 64bit rotations which are slow on P4,
making blowfish-x86_64 slower than generic C implementation. Therefore
blacklist P4.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-01-13 16:38:39 +11:00
Jussi Kivilinna a522ee85ba crypto: twofish-x86_64-3way - blacklist pentium4 and atom
Performance of twofish-x86_64-3way on Intel Pentium 4 and Atom is lower than
of twofish-x86_64 module. So blacklist these CPUs.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-01-13 16:38:39 +11:00
Jussi Kivilinna 7ba8babf84 crypto: serpent-sse2 - remove unneeded LRW/XTS #ifdefs
Since LRW & XTS are selected by serpent-sse2, we don't need these #ifdefs
anymore.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-12-20 15:20:08 +08:00
Jussi Kivilinna 88715b9ade crypto: twofish-x86_64-3way - remove unneeded LRW/XTS #ifdefs
Since LRW & XTS are selected by twofish-x86_64-3way, we don't need these
#ifdefs anymore.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-12-20 15:20:07 +08:00
Jussi Kivilinna d356433856 crypto: serpent-sse2 - clear CRYPTO_TFM_REQ_MAY_SLEEP in lrw and xts modes
LRW/XTS patches for serpent-sse2 forgot to add this. CRYPTO_TFM_REQ_MAY_SLEEP
should be cleared as sleeping between kernel_fpu_begin()/kernel_fpu_end() is
not allowed.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-21 16:13:25 +08:00
Jussi Kivilinna 5962f8b66d crypto: serpent-sse2 - add xts support
Patch adds XTS support for serpent-sse2 by using xts_crypt(). Patch has been
tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (serpent-sse2/serpent_generic speed ratios):

Intel Celeron T1600 (x86_64) (fam:6, model:15, step:13):
size    xts-enc xts-dec
16B     0.98x   1.00x
64B     1.00x   1.01x
256B    2.78x   2.75x
1024B   3.30x   3.26x
8192B   3.39x   3.30x

AMD Phenom II 1055T (x86_64) (fam:16, model:10):
size    xts-enc xts-dec
16B     1.05x   1.02x
64B     1.04x   1.03x
256B    2.10x   2.05x
1024B   2.34x   2.35x
8192B   2.34x   2.40x

Intel Atom N270 (i586):
size    xts-enc xts-dec
16B     0.95x   0.96x
64B     1.53x   1.50x
256B    1.72x   1.75x
1024B   1.88x   1.87x
8192B   1.86x   1.83x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-21 16:13:24 +08:00
Jussi Kivilinna 18482053f9 crypto: serpent-sse2 - add lrw support
Patch adds LRW support for serpent-sse2 by using lrw_crypt(). Patch has been
tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (serpent-sse2/serpent_generic speed ratios):

Benchmark results with tcrypt:

Intel Celeron T1600 (x86_64) (fam:6, model:15, step:13):
size    lrw-enc lrw-dec
16B     1.00x   0.96x
64B     1.01x   1.01x
256B    3.01x   2.97x
1024B   3.39x   3.33x
8192B   3.35x   3.33x

AMD Phenom II 1055T (x86_64) (fam:16, model:10):
size    lrw-enc lrw-dec
16B     0.98x   1.03x
64B     1.01x   1.04x
256B    2.10x   2.14x
1024B   2.28x   2.33x
8192B   2.30x   2.33x

Intel Atom N270 (i586):
size    lrw-enc lrw-dec
16B     0.97x   0.97x
64B     1.47x   1.50x
256B    1.72x   1.69x
1024B   1.88x   1.81x
8192B   1.84x   1.79x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-21 16:13:24 +08:00
Jussi Kivilinna 251496dbfc crypto: serpent - add 4-way parallel i586/SSE2 assembler implementation
Patch adds i586/SSE2 assembler implementation of serpent cipher. Assembler
functions crypt data in four block chunks.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (serpent-sse2/serpent_generic speed ratios):

Intel Atom N270:

size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16      0.95x   1.12x   1.02x   1.07x   0.97x   0.98x
64      1.73x   1.82x   1.08x   1.82x   1.72x   1.73x
256     2.08x   2.00x   1.04x   2.07x   1.99x   2.01x
1024    2.28x   2.18x   1.05x   2.23x   2.17x   2.20x
8192    2.28x   2.13x   1.05x   2.23x   2.18x   2.20x

Full output:
 http://koti.mbnet.fi/axh/kernel/crypto/atom-n270/serpent-generic.txt
 http://koti.mbnet.fi/axh/kernel/crypto/atom-n270/serpent-sse2.txt

Userspace test results:

Encryption/decryption of sse2-i586 vs generic on Intel Atom N270:
 encrypt: 2.35x
 decrypt: 2.54x

Encryption/decryption of sse2-i586 vs generic on AMD Phenom II:
 encrypt: 1.82x
 decrypt: 2.51x

Encryption/decryption of sse2-i586 vs generic on Intel Xeon E7330:
 encrypt: 2.99x
 decrypt: 3.48x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-21 16:13:23 +08:00
Jussi Kivilinna 937c30d7f5 crypto: serpent - add 8-way parallel x86_64/SSE2 assembler implementation
Patch adds x86_64/SSE2 assembler implementation of serpent cipher. Assembler
functions crypt data in eigth block chunks (two 4 block chunk SSE2 operations
in parallel to improve performance on out-of-order CPUs). Glue code is based
on one from AES-NI implementation, so requests from irq context are redirected
to cryptd.

v2:
 - add missing include of linux/module.h
   (appearently crypto.h used to include module.h, which changed for 3.2 by
    commit 7c926402a7)

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (serpent-sse2/serpent_generic speed ratios):

AMD Phenom II 1055T (fam:16, model:10):

size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     1.03x   1.01x   1.03x   1.05x   1.00x   0.99x
64B     1.00x   1.01x   1.02x   1.04x   1.02x   1.01x
256B    2.34x   2.41x   0.99x   2.43x   2.39x   2.40x
1024B   2.51x   2.57x   1.00x   2.59x   2.56x   2.56x
8192B   2.50x   2.54x   1.00x   2.55x   2.57x   2.57x

Intel Celeron T1600 (fam:6, model:15, step:13):

size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     0.97x   0.97x   1.01x   1.01x   1.01x   1.02x
64B     1.00x   1.00x   1.00x   1.02x   1.01x   1.01x
256B    3.41x   3.35x   1.00x   3.39x   3.42x   3.44x
1024B   3.75x   3.72x   0.99x   3.74x   3.75x   3.75x
8192B   3.70x   3.68x   0.99x   3.68x   3.69x   3.69x

Full output:
 http://koti.mbnet.fi/axh/kernel/crypto/phenom-ii-1055t/serpent-generic.txt
 http://koti.mbnet.fi/axh/kernel/crypto/phenom-ii-1055t/serpent-sse2.txt
 http://koti.mbnet.fi/axh/kernel/crypto/celeron-t1600/serpent-generic.txt
 http://koti.mbnet.fi/axh/kernel/crypto/celeron-t1600/serpent-sse2.txt

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-21 16:13:23 +08:00
Jussi Kivilinna bae6d3038b crypto: twofish-x86_64-3way - add xts support
Patch adds XTS support for twofish-x86_64-3way by using xts_crypt(). Patch has
been tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (twofish-3way/twofish-asm speed ratios):

Intel Celeron T1600 (fam:6, model:15, step:13):

size    xts-enc xts-dec
16B     0.98x   1.00x
64B     1.14x   1.15x
256B    1.23x   1.25x
1024B   1.26x   1.29x
8192B   1.28x   1.30x

AMD Phenom II 1055T (fam:16, model:10):

size    xts-enc xts-dec
16B     1.03x   1.03x
64B     1.13x   1.16x
256B    1.20x   1.20x
1024B   1.22x   1.22x
8192B   1.22x   1.21x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-09 11:57:57 +08:00
Jussi Kivilinna 81559f9ad3 crypto: twofish-x86_64-3way - add lrw support
Patch adds LRW support for twofish-x86_64-3way by using lrw_crypt(). Patch has
been tested with tcrypt and automated filesystem tests.

Tcrypt benchmarks results (twofish-3way/twofish-asm speed ratios):

Intel Celeron T1600 (fam:6, model:15, step:13):

size	lrw-enc	lrw-dec
16B	0.99x	1.00x
64B	1.17x	1.17x
256B	1.26x	1.27x
1024B	1.30x	1.31x
8192B	1.31x	1.32x

AMD Phenom II 1055T (fam:16, model:10):

size	lrw-enc	lrw-dec
16B	1.06x	1.01x
64B	1.08x	1.14x
256B	1.19x	1.20x
1024B	1.21x	1.22x
8192B	1.23x	1.24x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-11-09 11:53:32 +08:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Paul Gortmaker 7c52d55170 x86: fix up files really needing to include module.h
These files aren't just exporting symbols -- they are also defining
a MODULE_LICENSE etc. so give them the full module.h file.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:36 -04:00
Jussi Kivilinna 906b2c9f2d crypto: twofish-x86_64-3way - fix ctr blocksize to 1
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:28:57 +02:00
Jussi Kivilinna a516ebafdf crypto: blowfish-x86_64 - fix ctr blocksize to 1
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:28:57 +02:00
Jussi Kivilinna 8280daad43 crypto: twofish - add 3-way parallel x86_64 assembler implemention
Patch adds 3-way parallel x86_64 assembly implementation of twofish as new
module. New assembler functions crypt data in three blocks chunks, improving
cipher performance on out-of-order CPUs.

Patch has been tested with tcrypt and automated filesystem tests.

Summary of the tcrypt benchmarks:

Twofish 3-way-asm vs twofish asm (128bit 8kb block ECB)
 encrypt: 1.3x speed
 decrypt: 1.3x speed

Twofish 3-way-asm vs twofish asm (128bit 8kb block CBC)
 encrypt: 1.07x speed
 decrypt: 1.4x speed

Twofish 3-way-asm vs twofish asm (128bit 8kb block CTR)
 encrypt: 1.4x speed

Twofish 3-way-asm vs AES asm (128bit 8kb block ECB)
 encrypt: 1.0x speed
 decrypt: 1.0x speed

Twofish 3-way-asm vs AES asm (128bit 8kb block CBC)
 encrypt: 0.84x speed
 decrypt: 1.09x speed

Twofish 3-way-asm vs AES asm (128bit 8kb block CTR)
 encrypt: 1.15x speed

Full output:
 http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-twofish-3way-asm-x86_64.txt
 http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-twofish-asm-x86_64.txt
 http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-aes-asm-x86_64.txt

Tests were run on:
 vendor_id  : AuthenticAMD
 cpu family : 16
 model      : 10
 model name : AMD Phenom(tm) II X6 1055T Processor

Also userspace test were run on:
 vendor_id  : GenuineIntel
 cpu family : 6
 model      : 15
 model name : Intel(R) Xeon(R) CPU           E7330  @ 2.40GHz
 stepping   : 11

Userspace test results:

Encryption/decryption of twofish 3-way vs x86_64-asm on AMD Phenom II:
 encrypt: 1.27x
 decrypt: 1.25x

Encryption/decryption of twofish 3-way vs x86_64-asm on Intel Xeon E7330:
 encrypt: 1.36x
 decrypt: 1.36x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:23:08 +02:00
Jussi Kivilinna 91d41f159d crypto: twofish-x86-asm - make assembler functions use twofish_ctx instead of crypto_tfm
This needed by 3-way twofish patch to be able to easily use one block
assembler functions. As glue code is shared between i586/x86_64 apply
change to i586 assembler too. Also export assembler functions for
3-way parallel twofish module.

CC: Joachim Fritschi <jfritschi@freenet.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:23:08 +02:00
Jussi Kivilinna a071d06e34 crypto: blowfish-x86_64 - add credits
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:23:07 +02:00
Jussi Kivilinna e827bb09c8 crypto: blowfish-x86_64 - improve x86_64 blowfish 4-way performance
This patch adds improved F-macro for 4-way parallel functions. With new
F-macro for 4-way parallel functions, blowfish sees ~15% improvement in
speed tests on AMD Phenom II (~5% on Intel Xeon E7330).

However when used in 1-way blowfish function new macro would be ~10%
slower than original, so old F-macro is kept for 1-way functions.
Patch cleans up old F-macro as it is no longer needed in 4-way part.

Patch also does register macro renaming to reduce stack usage.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-10-21 14:23:07 +02:00
H Hartley Sweeten 4a4cc2b6bf crypto: aes-x86 - quiet sparse noise about symbol not declared
Include <asm/aes.h> to pick up the declarations for crypto_aes_encrypt_x86
and crypto_aes_decrypt_x86 to quiet the sparse noise:

warning: symbol 'crypto_aes_encrypt_x86' was not declared. Should it be static?
warning: symbol 'crypto_aes_decrypt_x86' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-09-22 21:33:01 +10:00
Jussi Kivilinna 64b94ceae8 crypto: blowfish - add x86_64 assembly implementation
Patch adds x86_64 assembly implementation of blowfish. Two set of assembler
functions are provided. First set is regular 'one-block at time'
encrypt/decrypt functions. Second is 'four-block at time' functions that
gain performance increase on out-of-order CPUs. Performance of 4-way
functions should be equal to 1-way functions with in-order CPUs.

Summary of the tcrypt benchmarks:

Blowfish assembler vs blowfish C (256bit 8kb block ECB)
encrypt: 2.2x speed
decrypt: 2.3x speed

Blowfish assembler vs blowfish C (256bit 8kb block CBC)
encrypt: 1.12x speed
decrypt: 2.5x speed

Blowfish assembler vs blowfish C (256bit 8kb block CTR)
encrypt: 2.5x speed

Full output:
http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-blowfish-asm-x86_64.txt
http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-blowfish-c-x86_64.txt

Tests were run on:
 vendor_id	: AuthenticAMD
 cpu family	: 16
 model		: 10
 model name	: AMD Phenom(tm) II X6 1055T Processor
 stepping	: 0

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-09-22 21:25:26 +10:00
Mathias Krause 66be895158 crypto: sha1 - SSSE3 based SHA1 implementation for x86-64
This is an assembler implementation of the SHA1 algorithm using the
Supplemental SSE3 (SSSE3) instructions or, when available, the
Advanced Vector Extensions (AVX).

Testing with the tcrypt module shows the raw hash performance is up to
2.3 times faster than the C implementation, using 8k data blocks on a
Core 2 Duo T5500. For the smalest data set (16 byte) it is still 25%
faster.

Since this implementation uses SSE/YMM registers it cannot safely be
used in every situation, e.g. while an IRQ interrupts a kernel thread.
The implementation falls back to the generic SHA1 variant, if using
the SSE/YMM registers is not possible.

With this algorithm I was able to increase the throughput of a single
IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using
the SSSE3 variant -- a speedup of +34.8%.

Saving and restoring SSE/YMM state might make the actual throughput
fluctuate when there are FPU intensive userland applications running.
For example, meassuring the performance using iperf2 directly on the
machine under test gives wobbling numbers because iperf2 uses the FPU
for each packet to check if the reporting interval has expired (in the
above test I got min/max/avg: 402/484/464 MBit/s).

Using this algorithm on a IPsec gateway gives much more reasonable and
stable numbers, albeit not as high as in the directly connected case.
Here is the result from an RFC 2544 test run with a EXFO Packet Blazer
FTB-8510:

 frame size    sha1-generic     sha1-ssse3    delta
    64 byte     37.5 MBit/s    37.5 MBit/s     0.0%
   128 byte     56.3 MBit/s    62.5 MBit/s   +11.0%
   256 byte     87.5 MBit/s   100.0 MBit/s   +14.3%
   512 byte    131.3 MBit/s   150.0 MBit/s   +14.2%
  1024 byte    162.5 MBit/s   193.8 MBit/s   +19.3%
  1280 byte    175.0 MBit/s   212.5 MBit/s   +21.4%
  1420 byte    175.0 MBit/s   218.7 MBit/s   +25.0%
  1518 byte    150.0 MBit/s   181.2 MBit/s   +20.8%

The throughput for the largest frame size is lower than for the
previous size because the IP packets need to be fragmented in this
case to make there way through the IPsec tunnel.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Maxim Locktyukhin <maxim.locktyukhin@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-08-10 19:00:29 +08:00
Gustavo F. Padovan c3e73e76a9 crypto: ghash-intel - Fix set but not used in ghash_async_setkey()
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-06-30 07:43:42 +08:00
Randy Dunlap 9bed4aca29 crypto: aesni-intel - fix aesni build on i386
Fix build error on i386 by moving function prototypes:

arch/x86/crypto/aesni-intel_glue.c: In function 'aesni_init':
arch/x86/crypto/aesni-intel_glue.c:1263: error: implicit declaration of function 'crypto_fpu_init'
arch/x86/crypto/aesni-intel_glue.c: In function 'aesni_exit':
arch/x86/crypto/aesni-intel_glue.c:1373: error: implicit declaration of function 'crypto_fpu_exit'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-05-18 09:03:34 +10:00
Andy Lutomirski b23b645165 crypto: aesni-intel - Merge with fpu.ko
Loading fpu without aesni-intel does nothing.  Loading aesni-intel
without fpu causes modes like xts to fail.  (Unloading
aesni-intel will restore those modes.)

One solution would be to make aesni-intel depend on fpu, but it
seems cleaner to just combine the modules.

This is probably responsible for bugs like:
https://bugzilla.redhat.com/show_bug.cgi?id=589390

Signed-off-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-05-16 15:12:47 +10:00
Tadeusz Struk 60af520cf2 crypto: aesni-intel - fixed problem with packets that are not multiple of 64bytes
This patch fixes problem with packets that are not multiple of 64bytes.

Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-03-27 10:29:39 +08:00
Linus Torvalds f2e1fbb5f2 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Flush TLB if PGD entry is changed in i386 PAE mode
  x86, dumpstack: Correct stack dump info when frame pointer is available
  x86: Clean up csum-copy_64.S a bit
  x86: Fix common misspellings
  x86: Fix misspelling and align params
  x86: Use PentiumPro-optimized partial_csum() on VIA C7
2011-03-18 10:45:21 -07:00
Lucas De Marchi 0d2eb44f63 x86: Fix common misspellings
They were generated by 'codespell' and then manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-18 10:39:30 +01:00
Jesper Juhl fc9044e2db crypto: aesni-intel - Fix remaining leak in rfc4106_set_hash_key
Fix up previous patch that failed to properly fix mem leak in 
rfc4106_set_hash_subkey(). This add-on patch; fixes the leak. moves 
kfree() out of the error path, returns -ENOMEM rather than -EINVAL when 
ablkcipher_request_alloc() fails.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-02-16 13:04:09 +11:00
Jesper Juhl 7efd95f627 crypto: aesni-intel - Don't leak memory in rfc4106_set_hash_subkey
There's a small memory leak in 
arch/x86/crypto/aesni-intel_glue.c::rfc4106_set_hash_subkey(). If the call 
to kmalloc() fails and returns NULL then the memory allocated previously 
by ablkcipher_request_alloc() is not freed when we leave the function.

I could have just added a call to ablkcipher_request_free() before we 
return -ENOMEM, but that started to look too much like the code we 
already had at the end of the function, so I chose instead to rework the 
code a bit so that there are now a few labels at the end that we goto when 
various allocations fail, so we don't have to repeat the same blocks of 
code (this also reduces the object code size slightly).

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2011-01-23 18:59:17 +11:00
Linus Torvalds 27d189c02b Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits)
  hwrng: via_rng - Fix memory scribbling on some CPUs
  crypto: padlock - Move padlock.h into include/crypto
  hwrng: via_rng - Fix asm constraints
  crypto: n2 - use __devexit not __exit in n2_unregister_algs
  crypto: mark crypto workqueues CPU_INTENSIVE
  crypto: mv_cesa - dont return PTR_ERR() of wrong pointer
  crypto: ripemd - Set module author and update email address
  crypto: omap-sham - backlog handling fix
  crypto: gf128mul - Remove experimental tag
  crypto: af_alg - fix af_alg memory_allocated data type
  crypto: aesni-intel - Fixed build with binutils 2.16
  crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets
  net: Add missing lockdep class names for af_alg
  include: Install linux/if_alg.h for user-space crypto API
  crypto: omap-aes - checkpatch --file warning fixes
  crypto: omap-aes - initialize aes module once per request
  crypto: omap-aes - unnecessary code removed
  crypto: omap-aes - error handling implementation improved
  crypto: omap-aes - redundant locking is removed
  crypto: omap-aes - DMA initialization fixes for OMAP off mode
  ...
2011-01-13 10:25:58 -08:00
Randy Dunlap 52f6c5ad43 crypto: ghash-intel - ghash-clmulni-intel_glue needs err.h
Add missing header file:

arch/x86/crypto/ghash-clmulni-intel_glue.c:256: error: implicit declaration of function 'IS_ERR'
arch/x86/crypto/ghash-clmulni-intel_glue.c:257: error: implicit declaration of function 'PTR_ERR'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-12-15 19:44:08 +08:00
Tadeusz Struk 3c097b8008 crypto: aesni-intel - Fixed build with binutils 2.16
This patch fixes the problem with 2.16 binutils.

Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-12-13 19:51:15 +08:00
Mathias Krause 559ad0ff13 crypto: aesni-intel - Fixed build error on x86-32
Exclude AES-GCM code for x86-32 due to heavy usage of 64-bit registers
not available on x86-32.

While at it, fixed unregister order in aesni_exit().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-29 08:35:39 +08:00
Mathias Krause 0d258efb6a crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.

To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:

x86:                   i568       aes-ni    delta
ECB, 256 bit:     93.8 MB/s   123.3 MB/s   +31.4%
CBC, 256 bit:     84.8 MB/s   262.3 MB/s  +209.3%
LRW, 256 bit:    108.6 MB/s   222.1 MB/s  +104.5%
XTS, 256 bit:    105.0 MB/s   205.5 MB/s   +95.7%

Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:

x86-64:           old impl.    new impl.    delta
ECB, 256 bit:    121.1 MB/s   123.0 MB/s    +1.5%
CBC, 256 bit:    285.3 MB/s   290.8 MB/s    +1.9%
LRW, 256 bit:    263.7 MB/s   265.3 MB/s    +0.6%
XTS, 256 bit:    251.1 MB/s   255.3 MB/s    +1.7%

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 16:34:46 +08:00
Tadeusz Struk 0bd82f5f63 crypto: aesni-intel - RFC4106 AES-GCM Driver Using Intel New Instructions
This patch adds an optimized RFC4106 AES-GCM implementation for 64-bit
kernels. It supports 128-bit AES key size. This leverages the crypto
AEAD interface type to facilitate a combined AES & GCM operation to
be implemented in assembly code. The assembly code leverages Intel(R)
AES New Instructions and the PCLMULQDQ instruction.

Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Erdinc Ozturk <erdinc.ozturk@intel.com>
Signed-off-by: James Guilford <james.guilford@intel.com>
Signed-off-by: Wajdi Feghali <wajdi.k.feghali@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-13 21:47:55 +09:00
Herbert Xu df2071bd08 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-05-03 11:28:58 +08:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Huang Ying 32cbd7dfce crypto: aesni-intel - Fix CTR optimization build failure with gas 2.16.1
Andrew Morton reported that AES-NI CTR optimization failed to compile
with gas 2.16.1, the error message is as follow:

arch/x86/crypto/aesni-intel_asm.S: Assembler messages:
arch/x86/crypto/aesni-intel_asm.S:752: Error: suffix or operands invalid for `movq'
arch/x86/crypto/aesni-intel_asm.S:753: Error: suffix or operands invalid for `movq'

To fix this, a gas macro is defined to assemble movq with 64bit
general purpose registers and XMM registers. The macro will generate
the raw .byte sequence for needed instructions.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-03-13 16:28:42 +08:00
Huang Ying 12387a46bb crypto: aesni-intel - Add AES-NI accelerated CTR mode
To take advantage of the hardware pipeline implementation of AES-NI
instructions. CTR mode cryption is implemented in ASM to schedule
multiple AES-NI instructions one after another. This way, some latency
of AES-NI instruction can be eliminated.

Performance testing based on dm-crypt should 50% reduction of
ecryption/decryption time.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-03-10 18:28:55 +08:00
Daniel Mack 3ad2f3fbb9 tree-wide: Assorted spelling fixes
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-09 11:13:56 +01:00
Herbert Xu 8386324381 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-12-01 15:16:22 +08:00
Jiri Kosina 68ee87164e crypto: ghash-clmulni-intel - Put proper .data section in place
Lbswap_mask, Lpoly and Ltwo_one should clearly belong to
.data section, not .text.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-11-23 20:19:47 +08:00
Huang Ying 564ec0ec05 crypto: ghash-clmulni-intel - Use gas macro for PCLMULQDQ-NI and PSHUFB
Old binutils do not support PCLMULQDQ-NI and PSHUFB, to make kernel
can be compiled by them, .byte code is used instead of assembly
instructions. But the readability and flexibility of raw .byte code is
not good.

So corresponding assembly instruction like gas macro is used instead.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-11-23 19:55:22 +08:00
Huang Ying b369e52123 crypto: aesni-intel - Use gas macro for AES-NI instructions
Old binutils do not support AES-NI instructions, to make kernel can be
compiled by them, .byte code is used instead of AES-NI assembly
instructions. But the readability and flexibility of raw .byte code is
not good.

So corresponding assembly instruction like gas macro is used instead.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-11-23 19:54:06 +08:00
Huang Ying 01dd958277 crypto: ghash-intel - Fix irq_fpu_usable usage
When renaming kernel_fpu_using to irq_fpu_usable, the semantics of the
function is changed too, from mesuring whether kernel is using FPU,
that is, the FPU is NOT available, to measuring whether FPU is usable,
that is, the FPU is available.

But the usage of irq_fpu_usable in ghash-clmulni-intel_glue.c is not
changed accordingly. This patch fixes this.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-11-03 10:55:20 -05:00
Herbert Xu 3b0d65969b crypto: ghash-intel - Add PSHUFB macros
Add PSHUFB macros instead of repeating byte sequences, suggested
by Ingo.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-11-03 09:11:15 -05:00
Herbert Xu 2d06ef7f42 crypto: ghash-intel - Hard-code pshufb
Old gases don't have a clue what pshufb stands for so we have
to hard-code it for now.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-11-01 12:49:44 -05:00
Huang Ying 13b79b9715 crypto: aesni-intel - Fix irq_fpu_usable usage
When renaming kernel_fpu_using to irq_fpu_usable, the semantics of the
function is changed too, from mesuring whether kernel is using FPU,
that is, the FPU is NOT available, to measuring whether FPU is usable,
that is, the FPU is available.

But the usage of irq_fpu_usable in aesni-intel_glue.c is not changed
accordingly. This patch fixes this.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-10-20 16:20:47 +09:00
Huang Ying 0e1227d356 crypto: ghash - Add PCLMULQDQ accelerated implementation
PCLMULQDQ is used to accelerate the most time-consuming part of GHASH,
carry-less multiplication. More information about PCLMULQDQ can be
found at:

http://software.intel.com/en-us/articles/carry-less-multiplication-and-its-usage-for-computing-the-gcm-mode/

Because PCLMULQDQ changes XMM state, its usage must be enclosed with
kernel_fpu_begin/end, which can be used only in process context, the
acceleration is implemented as crypto_ahash. That is, request in soft
IRQ context will be defered to the cryptd kernel thread.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-10-19 11:53:06 +09:00
Linus Torvalds c7208de304 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (22 commits)
  x86: Fix code patching for paravirt-alternatives on 486
  x86, msr: change msr-reg.o to obj-y, and export its symbols
  x86: Use hard_smp_processor_id() to get apic id for AMD K8 cpus
  x86, sched: Workaround broken sched domain creation for AMD Magny-Cours
  x86, mcheck: Use correct cpumask for shared bank4
  x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors
  x86: Fix CPU llc_shared_map information for AMD Magny-Cours
  x86, msr: Fix msr-reg.S compilation with gas 2.16.1, on 32-bit too
  x86: Move kernel_fpu_using to irq_fpu_usable in asm/i387.h
  x86, msr: fix msr-reg.S compilation with gas 2.16.1
  x86, msr: Export the register-setting MSR functions via /dev/*/msr
  x86, msr: Create _on_cpu helpers for {rw,wr}msr_safe_regs()
  x86, msr: Have the _safe MSR functions return -EIO, not -EFAULT
  x86, msr: CFI annotations, cleanups for msr-reg.S
  x86, asm: Make _ASM_EXTABLE() usable from assembly code
  x86, asm: Add 32-bit versions of the combined CFI macros
  x86, AMD: Disable wrongly set X86_FEATURE_LAHF_LM CPUID bit
  x86, msr: Rewrite AMD rd/wrmsr variants
  x86, msr: Add rd/wrmsr interfaces with preset registers
  x86: add specific support for Intel Atom architecture
  ...
2009-09-14 07:57:32 -07:00
Huang Ying ae4b688db2 x86: Move kernel_fpu_using to irq_fpu_usable in asm/i387.h
This function measures whether the FPU/SSE state can be touched in
interrupt context. If the interrupted code is in user space or has no
valid FPU/SSE context (CR0.TS == 1), FPU/SSE state can be used in IRQ
or soft_irq context too.

This is used by AES-NI accelerated AES implementation and PCLMULQDQ
accelerated GHASH implementation.

v3:
 - Renamed to irq_fpu_usable to reflect the purpose of the function.

v2:
 - Renamed to irq_is_fpu_using to reflect the real situation.

Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-09-01 21:39:15 -07:00
Roland Dreier c9944881ac crypto: aes-ni - Don't print message with KERN_ERR on old system
When the aes-intel module is loaded on a system that does not have the
AES instructions, it prints

    Intel AES-NI instructions are not detected.

at level KERN_ERR.  Since aes-intel is aliased to "aes" it will be tried
whenever anything uses AES and spam the console.  This doesn't match
existing practice for how to handle "no hardware" when initializing a
module, so downgrade the message to KERN_INFO.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-06-24 13:42:40 +08:00
Huang Ying b6f34d44cb crypto: aes-ni - Remove CRYPTO_TFM_REQ_MAY_SLEEP from fpu template
kernel_fpu_begin/end used preempt_disable/enable, so sleep should be
prevented between kernel_fpu_begin/end.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-06-18 19:44:01 +08:00
Huang Ying 9251b64fb2 crypto: aes-ni - Do not sleep when using the FPU
Because AES-NI instructions will touch XMM state, corresponding code
must be enclosed within kernel_fpu_begin/end, which used
preempt_disable/enable. So sleep should be prevented between
kernel_fpu_begin/end.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-06-18 19:41:27 +08:00
Huang Ying e6efaa0253 crypto: aes-ni - Fix cbc mode IV saving
Original implementation of aesni_cbc_dec do not save IV if input
length % 4 == 0. This will make decryption of next block failed.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-06-18 19:33:57 +08:00