mirror of
https://github.com/fumiama/blake2b-simd.git
synced 2026-06-05 02:00:26 +08:00
Add support for SSE (#10)
This commit is contained in:
106
README.md
106
README.md
@@ -6,23 +6,61 @@ Pure Go implementation of BLAKE2b using SIMD optimizations.
|
||||
Introduction
|
||||
------------
|
||||
|
||||
This package is based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merges it with the (`cgo` dependent) SSE optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a fallback for other architectures.
|
||||
This package was initially based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merged with the (`cgo` dependent) AVX optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on the [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a golang only fallback for other architectures.
|
||||
|
||||
It gives roughly a 3x performance improvement over the non-optimized go version.
|
||||
In addition to AVX there is also support for AVX2 as well as SSE. Best performance is obtained with AVX2 which gives roughly a **4X** performance increase approaching hashing speeds of **1GB/sec** on a single core.
|
||||
|
||||
Benchmarks
|
||||
----------
|
||||
|
||||
| Dura | 1 GB |
|
||||
| ------------- |:-----:|
|
||||
| blake2b-SIMD | 1.59s |
|
||||
| blake2b | 4.66s |
|
||||
This is a summary of the performance improvements. Full details are shown below.
|
||||
|
||||
| Technology | 128K |
|
||||
| ---------- |:-----:|
|
||||
| AVX2 | 3.94x |
|
||||
| AVX | 3.28x |
|
||||
| SSE | 2.85x |
|
||||
|
||||
asm2plan9s
|
||||
----------
|
||||
|
||||
In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/fwessels/asm2plan9s) for more information.
|
||||
|
||||
bt2sum
|
||||
------
|
||||
|
||||
[bt2sum](https://github.com/s3git/bt2sum) is a utility that takes advantages of the BLAKE2b SIMD optimizations to compute check sums using the BLAKE2 Tree hashing mode in so called 'unlimited fanout' mode.
|
||||
|
||||
Technical details
|
||||
-----------------
|
||||
|
||||
BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four operations in parallel. AVX and SSE operate on 128-bit values simultaneously (two operations in parallel). Below are excerpts from `compressAvx2_amd64.s`, `compressAvx_amd64.s`, and `compress_generic.go` respectively.
|
||||
|
||||
```
|
||||
VPADDQ YMM0,YMM0,YMM1 /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */
|
||||
```
|
||||
|
||||
```
|
||||
VPADDQ XMM0,XMM0,XMM2 /* v0 += v4, v1 += v5 */
|
||||
VPADDQ XMM1,XMM1,XMM3 /* v2 += v6, v3 += v7 */
|
||||
```
|
||||
|
||||
```
|
||||
v0 += v4
|
||||
v1 += v5
|
||||
v2 += v6
|
||||
v3 += v7
|
||||
```
|
||||
|
||||
Detailed benchmarks
|
||||
-------------------
|
||||
|
||||
Example performance metrics were generated on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations).
|
||||
|
||||
### AVX2
|
||||
|
||||
```
|
||||
$ benchcmp old.txt new.txt
|
||||
$ benchcmp go.txt avx2.txt
|
||||
benchmark old ns/op new ns/op delta
|
||||
BenchmarkHash64-12 1481 849 -42.67%
|
||||
BenchmarkHash128-12 1428 746 -47.76%
|
||||
@@ -40,4 +78,56 @@ BenchmarkHash32K-12 232.87 911.85 3.92x
|
||||
BenchmarkHash128K-12 233.37 918.93 3.94x
|
||||
```
|
||||
|
||||
We can see `2-3x` improvement in performance over native Go under varying block sizes.
|
||||
Benchmarks below were generated on a MacBook Pro with a 2.7 GHz Intel Core i7.
|
||||
|
||||
### AVX
|
||||
|
||||
```
|
||||
$ benchcmp go.txt avx.txt
|
||||
benchmark old ns/op new ns/op delta
|
||||
BenchmarkHash64-8 813 458 -43.67%
|
||||
BenchmarkHash128-8 766 401 -47.65%
|
||||
BenchmarkHash1K-8 4881 1763 -63.88%
|
||||
BenchmarkHash8K-8 36127 12273 -66.03%
|
||||
BenchmarkHash32K-8 140582 43155 -69.30%
|
||||
BenchmarkHash128K-8 567850 173246 -69.49%
|
||||
|
||||
benchmark old MB/s new MB/s speedup
|
||||
BenchmarkHash64-8 78.63 139.57 1.78x
|
||||
BenchmarkHash128-8 166.98 318.73 1.91x
|
||||
BenchmarkHash1K-8 209.76 580.68 2.77x
|
||||
BenchmarkHash8K-8 226.76 667.46 2.94x
|
||||
BenchmarkHash32K-8 233.09 759.29 3.26x
|
||||
BenchmarkHash128K-8 230.82 756.56 3.28x
|
||||
```
|
||||
|
||||
### SSE
|
||||
|
||||
```
|
||||
$ benchcmp go.txt sse.txt
|
||||
benchmark old ns/op new ns/op delta
|
||||
BenchmarkHash64-8 813 478 -41.21%
|
||||
BenchmarkHash128-8 766 411 -46.34%
|
||||
BenchmarkHash1K-8 4881 1870 -61.69%
|
||||
BenchmarkHash8K-8 36127 12427 -65.60%
|
||||
BenchmarkHash32K-8 140582 49512 -64.78%
|
||||
BenchmarkHash128K-8 567850 199040 -64.95%
|
||||
|
||||
benchmark old MB/s new MB/s speedup
|
||||
BenchmarkHash64-8 78.63 133.78 1.70x
|
||||
BenchmarkHash128-8 166.98 311.23 1.86x
|
||||
BenchmarkHash1K-8 209.76 547.37 2.61x
|
||||
BenchmarkHash8K-8 226.76 659.20 2.91x
|
||||
BenchmarkHash32K-8 233.09 661.81 2.84x
|
||||
BenchmarkHash128K-8 230.82 658.52 2.85x
|
||||
```
|
||||
|
||||
License
|
||||
-------
|
||||
|
||||
Released under the Apache License v2.0. You can find the complete text in the file LICENSE.
|
||||
|
||||
Contributing
|
||||
------------
|
||||
|
||||
Contributions are welcome, please send PRs for any enhancements.
|
||||
40
compressSse_amd64.go
Normal file
40
compressSse_amd64.go
Normal file
@@ -0,0 +1,40 @@
|
||||
//+build !noasm
|
||||
//+build !appengine
|
||||
|
||||
/*
|
||||
* Minio Cloud Storage, (C) 2016 Minio, Inc.
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||
* you may not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package blake2b
|
||||
|
||||
//go:noescape
|
||||
func blockSSELoop(p []uint8, in, iv, t, f, shffle, out []uint64)
|
||||
|
||||
func compressSSE(d *digest, p []uint8) {
|
||||
|
||||
in := make([]uint64, 8, 8)
|
||||
out := make([]uint64, 8, 8)
|
||||
|
||||
shffle := make([]uint64, 2, 2)
|
||||
// vector for PSHUFB instruction
|
||||
shffle[0] = 0x0201000706050403
|
||||
shffle[1] = 0x0a09080f0e0d0c0b
|
||||
|
||||
in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7] = d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7]
|
||||
|
||||
blockSSELoop(p, in, iv[:], d.t[:], d.f[:], shffle, out)
|
||||
|
||||
d.h[0], d.h[1], d.h[2], d.h[3], d.h[4], d.h[5], d.h[6], d.h[7] = out[0], out[1], out[2], out[3], out[4], out[5], out[6], out[7]
|
||||
}
|
||||
882
compressSse_amd64.s
Normal file
882
compressSse_amd64.s
Normal file
@@ -0,0 +1,882 @@
|
||||
//+build !noasm !appengine
|
||||
|
||||
//
|
||||
// Minio Cloud Storage, (C) 2016 Minio, Inc.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
//
|
||||
|
||||
//
|
||||
// Based on SSE implementation from https://github.com/BLAKE2/BLAKE2/blob/master/sse/blake2b.c
|
||||
//
|
||||
// Use github.com/fwessels/asm2plan9s on this file to assemble instructions to their Plan9 equivalent
|
||||
//
|
||||
// Assembly code below essentially follows the ROUND macro (see blake2b-round.h) which is defined as:
|
||||
// #define ROUND(r) \
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1); \
|
||||
// G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1); \
|
||||
// G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
||||
// DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1); \
|
||||
// G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1); \
|
||||
// G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
||||
// UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
|
||||
//
|
||||
// as well as the go equivalent in https://github.com/dchest/blake2b/blob/master/block.go
|
||||
//
|
||||
// As in the macro, G1/G2 in the 1st and 2nd half are identical (so literal copy of assembly)
|
||||
//
|
||||
// Rounds are also the same, except for the loading of the message (and rounds 1 & 11 and
|
||||
// rounds 2 & 12 are identical)
|
||||
//
|
||||
|
||||
#define G1 \
|
||||
\ // G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1);
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc0 \ // PADDQ XMM0,XMM8 /* v0 += m[0], v1 += m[2] */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc9 \ // PADDQ XMM1,XMM9 /* v2 += m[4], v3 += m[6] */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2 \ // PADDQ XMM0,XMM2 /* v0 += v4, v1 += v5 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb \ // PADDQ XMM1,XMM3 /* v2 += v6, v3 += v7 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf0 \ // PXOR XMM6,XMM0 /* v12 ^= v0, v13 ^= v1 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf9 \ // PXOR XMM7,XMM1 /* v14 ^= v2, v15 ^= v3 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0xb1 \ // PSHUFD XMM6,XMM6,0xb1 /* v12 = v12<<(64-32) | v12>>32, v13 = v13<<(64-32) | v13>>32 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0xb1 \ // PSHUFD XMM7,XMM7,0xb1 /* v14 = v14<<(64-32) | v14>>32, v15 = v15<<(64-32) | v15>>32 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xe6 \ // PADDQ XMM4,XMM6 /* v8 += v12, v9 += v13 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xef \ // PADDQ XMM5,XMM7 /* v10 += v14, v11 += v15 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xd4 \ // PXOR XMM2,XMM4 /* v4 ^= v8, v5 ^= v9 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xdd \ // PXOR XMM3,XMM5 /* v6 ^= v10, v7 ^= v11 */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x38; BYTE $0x00 \ // PSHUFB XMM2,XMM12 /* v4 = v4<<(64-24) | v4>>24, v5 = v5<<(64-24) | v5>>24 */
|
||||
BYTE $0xd4 \
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x38; BYTE $0x00 \ // PSHUFB XMM3,XMM12 /* v6 = v6<<(64-24) | v6>>24, v7 = v7<<(64-24) | v7>>24 */
|
||||
BYTE $0xdc \
|
||||
// DO NOT DELETE -- macro delimiter (previous line extended)
|
||||
|
||||
#define G2 \
|
||||
\ // G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1);
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2 \ // PADDQ XMM0,XMM10 /* v0 += m[1], v1 += m[3] */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb \ // PADDQ XMM1,XMM11 /* v2 += m[5], v3 += m[7] */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xc2 \ // PADDQ XMM0,XMM2 /* v0 += v4, v1 += v5 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xcb \ // PADDQ XMM1,XMM3 /* v2 += v6, v3 += v7 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf0 \ // PXOR XMM6,XMM0 /* v12 ^= v0, v13 ^= v1 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xf9 \ // PXOR XMM7,XMM1 /* v14 ^= v2, v15 ^= v3 */
|
||||
BYTE $0xf2; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0x39 \ // PSHUFLW XMM6,XMM6,0x39 /* combined with next ... */
|
||||
BYTE $0xf3; BYTE $0x0f; BYTE $0x70; BYTE $0xf6; BYTE $0x39 \ // PSHUFHW XMM6,XMM6,0x39 /* v12 = v12<<(64-16) | v12>>16, v13 = v13<<(64-16) | v13>>16 */
|
||||
BYTE $0xf2; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0x39 \ // PSHUFLW XMM7,XMM7,0x39 /* combined with next ... */
|
||||
BYTE $0xf3; BYTE $0x0f; BYTE $0x70; BYTE $0xff; BYTE $0x39 \ // PSHUFHW XMM7,XMM7,0x39 /* v14 = v14<<(64-16) | v14>>16, v15 = v15<<(64-16) | v15>>16 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xe6 \ // PADDQ XMM4,XMM6 /* v8 += v12, v9 += v13 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xd4; BYTE $0xef \ // PADDQ XMM5,XMM7 /* v10 += v14, v11 += v15 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xd4 \ // PXOR XMM2,XMM4 /* v4 ^= v8, v5 ^= v9 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0xef; BYTE $0xdd \ // PXOR XMM3,XMM5 /* v6 ^= v10, v7 ^= v11 */
|
||||
MOVOU X2, X15 \
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0xd4; BYTE $0xfa \ // PADDQ XMM15,XMM2 /* temp reg = reg*2 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0x73; BYTE $0xd2; BYTE $0x3f \ // PSRLQ XMM2,0x3f /* reg = reg>>63 */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xef; BYTE $0xd7 \ // PXOR XMM2,XMM15 /* ORed together: v4 = v4<<(64-63) | v4>>63, v5 = v5<<(64-63) | v5>>63 */
|
||||
MOVOU X3, X15 \
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0xd4; BYTE $0xfb \ // PADDQ XMM15,XMM3 /* temp reg = reg*2 */
|
||||
BYTE $0x66; BYTE $0x0f; BYTE $0x73; BYTE $0xd3; BYTE $0x3f \ // PSRLQ XMM3,0x3f /* reg = reg>>63 */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0xef; BYTE $0xdf // PXOR XMM3,XMM15 /* ORed together: v6 = v6<<(64-63) | v6>>63, v7 = v7<<(64-63) | v7>>63 */
|
||||
|
||||
#define DIAGONALIZE \
|
||||
\ // DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
|
||||
MOVOU X6, X13 \ /* t0 = row4l;\ */
|
||||
MOVOU X2, X14 \ /* t1 = row2l;\ */
|
||||
MOVOU X4, X6 \ /* row4l = row3l;\ */
|
||||
MOVOU X5, X4 \ /* row3l = row3h;\ */
|
||||
MOVOU X6, X5 \ /* row3h = row4l;\ */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfd \ // PUNPCKLQDQ XMM15, XMM13 /* _mm_unpacklo_epi64(t0, t0) */
|
||||
MOVOU X7, X6 \
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xf7 \ // PUNPCKHQDQ XMM6, XMM15 /* row4l = _mm_unpackhi_epi64(row4h, ); \ */
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xff \ // PUNPCKLQDQ XMM15, XMM7 /* _mm_unpacklo_epi64(row4h, row4h) */
|
||||
MOVOU X13, X7 \
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xff \ // PUNPCKHQDQ XMM7, XMM15 /* row4h = _mm_unpackhi_epi64(t0, ); \ */
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfb \ // PUNPCKLQDQ XMM15, XMM3 /* _mm_unpacklo_epi64(row2h, row2h) */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 \ // PUNPCKHQDQ XMM2, XMM15 /* row2l = _mm_unpackhi_epi64(row2l, ); \ */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfe \ // PUNPCKLQDQ XMM15, XMM14 /* _mm_unpacklo_epi64(t1, t1) */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM3, XMM15 /* row2h = _mm_unpackhi_epi64(row2h, ) */
|
||||
|
||||
#define UNDIAGONALIZE \
|
||||
\ // UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
|
||||
MOVOU X4, X13 \ /* t0 = row3l;\ */
|
||||
MOVOU X5, X4 \ /* row3l = row3h;\ */
|
||||
MOVOU X13, X5 \ /* row3h = t0;\ */
|
||||
MOVOU X2, X13 \ /* t0 = row2l;\ */
|
||||
MOVOU X6, X14 \ /* t1 = row4l;\ */
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfa \ // PUNPCKLQDQ XMM15, XMM2 /* _mm_unpacklo_epi64(row2l, row2l) */
|
||||
MOVOU X3, X2 \
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 \ // PUNPCKHQDQ XMM2, XMM15 /* row2l = _mm_unpackhi_epi64(row2h, ); \ */
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xfb \ // PUNPCKLQDQ XMM15, XMM3 /* _mm_unpacklo_epi64(row2h, row2h) */
|
||||
MOVOU X13, X3 \
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf \ // PUNPCKHQDQ XMM3, XMM15 /* row2h = _mm_unpackhi_epi64(t0, ); \ */
|
||||
BYTE $0x66; BYTE $0x44; BYTE $0x0f; BYTE $0x6c; BYTE $0xff \ // PUNPCKLQDQ XMM15, XMM7 /* _mm_unpacklo_epi64(row4h, row4h) */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xf7 \ // PUNPCKHQDQ XMM6, XMM15 /* row4l = _mm_unpackhi_epi64(row4l, ); \ */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xfe \ // PUNPCKLQDQ XMM15, XMM14 /* _mm_unpacklo_epi64(t1, t1) */
|
||||
BYTE $0x66; BYTE $0x41; BYTE $0x0f; BYTE $0x6d; BYTE $0xff // PUNPCKHQDQ XMM7, XMM15 /* row4h = _mm_unpackhi_epi64(row4h, ) */
|
||||
|
||||
#define LOAD_SHUFFLE \
|
||||
\ // Load shuffle value
|
||||
MOVQ shffle+120(FP), SI \ // SI: &shuffle
|
||||
MOVOU 0(SI), X12 // X12 = 03040506 07000102 0b0c0d0e 0f08090a
|
||||
|
||||
// func blockSSELoop(p []uint8, in, iv, t, f, shffle, out []uint64)
|
||||
TEXT ·blockSSELoop(SB), 7, $0
|
||||
// REGISTER USE
|
||||
// R8: loop counter
|
||||
// DX: message pointer
|
||||
// SI: temp pointer for loading
|
||||
// X0 - X7: v0 - v15
|
||||
// X8 - X11: m[0] - m[7]
|
||||
// X12: shuffle value
|
||||
// X13 - X15: temp registers
|
||||
|
||||
// Load digest
|
||||
MOVQ in+24(FP), SI // SI: &in
|
||||
MOVOU 0(SI), X0 // X0 = in[0]+in[1] /* row1l = LOAD( &S->h[0] ); */
|
||||
MOVOU 16(SI), X1 // X1 = in[2]+in[3] /* row1h = LOAD( &S->h[2] ); */
|
||||
MOVOU 32(SI), X2 // X2 = in[4]+in[5] /* row2l = LOAD( &S->h[4] ); */
|
||||
MOVOU 48(SI), X3 // X3 = in[6]+in[7] /* row2h = LOAD( &S->h[6] ); */
|
||||
|
||||
// Already store digest into &out (so we can reload it later generically)
|
||||
MOVQ out+144(FP), SI // SI: &out
|
||||
MOVOU X0, 0(SI) // out[0]+out[1] = X0
|
||||
MOVOU X1, 16(SI) // out[2]+out[3] = X1
|
||||
MOVOU X2, 32(SI) // out[4]+out[5] = X2
|
||||
MOVOU X3, 48(SI) // out[6]+out[7] = X3
|
||||
|
||||
// Initialize message pointer and loop counter
|
||||
MOVQ message+0(FP), DX // DX: &p (message)
|
||||
MOVQ message_len+8(FP), R8 // R8: len(message)
|
||||
SHRQ $7, R8 // len(message) / 128
|
||||
CMPQ R8, $0
|
||||
JEQ complete
|
||||
|
||||
loop:
|
||||
// Increment counter
|
||||
MOVQ t+72(FP), SI // SI: &t
|
||||
MOVQ 0(SI), R9 //
|
||||
ADDQ $128, R9 // /* d.t[0] += BlockSize */
|
||||
MOVQ R9, 0(SI) //
|
||||
CMPQ R9, $128 // /* if d.t[0] < BlockSize { */
|
||||
JGE noincr //
|
||||
MOVQ 8(SI), R9 //
|
||||
ADDQ $1, R9 // /* d.t[1]++ */
|
||||
MOVQ R9, 8(SI) //
|
||||
noincr: // /* } */
|
||||
|
||||
// Load initialization vector
|
||||
MOVQ iv+48(FP), SI // SI: &iv
|
||||
MOVOU 0(SI), X4 // X4 = iv[0]+iv[1] /* row3l = LOAD( &blake2b_IV[0] ); */
|
||||
MOVOU 16(SI), X5 // X5 = iv[2]+iv[3] /* row3h = LOAD( &blake2b_IV[2] ); */
|
||||
MOVOU 32(SI), X6 // X6 = iv[4]+iv[5] /* LOAD( &blake2b_IV[4] ) */
|
||||
MOVOU 48(SI), X7 // X7 = iv[6]+iv[7] /* LOAD( &blake2b_IV[6] ) */
|
||||
MOVQ t+72(FP), SI // SI: &t
|
||||
MOVOU 0(SI), X8 // X8 = t[0]+t[1] /* LOAD( &S->t[0] ) */
|
||||
PXOR X8, X6 // X6 = X6 ^ X8 /* row4l = _mm_xor_si128( , ); */
|
||||
MOVQ t+96(FP), SI // SI: &f
|
||||
MOVOU 0(SI), X8 // X8 = f[0]+f[1] /* LOAD( &S->f[0] ) */
|
||||
PXOR X8, X7 // X7 = X7 ^ X8 /* row4h = _mm_xor_si128( , ); */
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 1
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+m[1]
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+m[3]
|
||||
MOVOU 32(DX), X14 // X14 = m[4]+m[5]
|
||||
MOVOU 48(DX), X15 // X15 = m[6]+m[7]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[0], m[2] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[4], m[6] */
|
||||
MOVOU X12, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5 // PUNPCKHQDQ XMM10, XMM13 /* m[1], m[3] */
|
||||
MOVOU X14, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM11, XMM15 /* m[5], m[7] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 64(DX), X12 // X12 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X13 // X13 = m[10]+m[11]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[8],m[10] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[12],m[14] */
|
||||
MOVOU X12, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5 // PUNPCKHQDQ XMM10, XMM13 /* m[9],m[11] */
|
||||
MOVOU X14, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM11, XMM15 /* m[13],m[15] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 2
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 112(DX), X12 // X12 = m[14]+m[15]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 96(DX), X15 // X15 = m[12]+m[13]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[14], m[4] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf // PUNPCKHQDQ XMM9, XMM15 /* m[9], m[13] */
|
||||
MOVOU 80(DX), X10 // X10 = m[10]+m[11]
|
||||
MOVOU 48(DX), X11 // X11 = m[6]+ m[7]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6 // PUNPCKLQDQ XMM10, XMM14 /* m[10], m[8] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM11, XMM12, 0x8 /* m[15], m[6] */; ; ; ; ;
|
||||
BYTE $0xdc; BYTE $0x08
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM8, XMM12, 0x8 /* m[1], m[0] */
|
||||
BYTE $0xc4; BYTE $0x08
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd // PUNPCKHQDQ XMM9, XMM13 /* m[11], m[5] */
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X11 // X11 = m[6]+ m[7]
|
||||
MOVOU 96(DX), X10 // X10 = m[12]+m[13]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4 // PUNPCKLQDQ XMM10, XMM12 /* m[12], m[2] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdc // PUNPCKHQDQ XMM11, XMM12 /* m[7], m[3] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 3
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 32(DX), X12 // X12 = m[4]+ m[5]
|
||||
MOVOU 80(DX), X13 // X13 = m[10]+m[11]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X14, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM8, XMM13, 0x8 /* m[11], m[12] */
|
||||
BYTE $0xc5; BYTE $0x08
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf // PUNPCKHQDQ XMM9, XMM15 /* m[5], m[15] */
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+ m[3]
|
||||
MOVOU 64(DX), X10 // X10 = m[8]+ m[9]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4 // PUNPCKLQDQ XMM10, XMM12 /* m[8], m[0] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6 // PUNPCKHQDQ XMM14, XMM14 /* ___, m[13] */
|
||||
MOVOU X13, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde // PUNPCKLQDQ XMM11, XMM14 /* m[2], ___ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X15 // X15 = m[10]+m[11]
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc // PUNPCKHQDQ XMM9, XMM12 /* ___, m[3] */
|
||||
MOVOU X15, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1 // PUNPCKLQDQ XMM8, XMM9 /* m[10], ___ */
|
||||
MOVOU X13, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce // PUNPCKHQDQ XMM9, XMM14 /* m[7], m[9] */
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 32(DX), X11 // X11 = m[4]+ m[5]
|
||||
MOVOU 112(DX), X10 // X10 = m[14]+m[15]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd5 // PUNPCKLQDQ XMM10, XMM13 /* m[14], m[6] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM11, XMM12, 0x8 /* m[1], m[4] */
|
||||
BYTE $0xdc; BYTE $0x08
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 4
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU 96(DX), X15 // X15 = m[12]+m[13]
|
||||
MOVOU X13, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc4 // PUNPCKHQDQ XMM8, XMM12 /* m[7], m[3] */
|
||||
MOVOU X15, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce // PUNPCKHQDQ XMM9, XMM14 /* m[13], m[11] */
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 64(DX), X10 // X10 = m[8]+ m[9]
|
||||
MOVOU 112(DX), X14 // X14 = m[14]+m[15]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd4 // PUNPCKHQDQ XMM10, XMM12 /* m[9], m[1] */
|
||||
MOVOU X15, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde // PUNPCKLQDQ XMM11, XMM14 /* m[12], m[14] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X13, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd // PUNPCKHQDQ XMM9, XMM13 /* ___, m[5] */
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1 // PUNPCKLQDQ XMM8, XMM9 /* m[2], ____ */
|
||||
MOVOU X15, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 // PUNPCKHQDQ XMM10, XMM15 /* ___, m[15] */
|
||||
MOVOU X13, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xca // PUNPCKLQDQ XMM9, XMM10 /* m[4], ____ */
|
||||
MOVOU 0(DX), X11 // X11 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X10 // X10 = m[6]+ m[7]
|
||||
MOVOU 64(DX), X15 // X15 = m[8]+ m[9]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6 // PUNPCKLQDQ XMM10, XMM14 /* m[6], m[10] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[0], m[8] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 5
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X15 // X15 = m[10]+m[11]
|
||||
MOVOU X14, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5 // PUNPCKHQDQ XMM8, XMM13 /* m[9], m[5] */
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[2], m[10] */
|
||||
MOVOU 0(DX), X10 // X10 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X14 // X14 = m[6]+ m[7]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6 // PUNPCKHQDQ XMM14, XMM14 /* ___, m[7] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6 // PUNPCKLQDQ XMM10, XMM14 /* m[0], ____ */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xff // PUNPCKHQDQ XMM15, XMM15 /* ___, m[15] */
|
||||
MOVOU X13, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[4], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6 // PUNPCKHQDQ XMM14, XMM14 /* ___, m[11] */
|
||||
MOVOU X15, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6 // PUNPCKLQDQ XMM8, XMM14 /* m[14], ____ */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xe4 // PUNPCKHQDQ XMM12, XMM12 /* ___, m[3] */
|
||||
MOVOU X13, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcc // PUNPCKLQDQ XMM9, XMM12 /* m[6], ____ */
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 64(DX), X11 // X11 = m[8]+ m[9]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU X14, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM10, XMM12, 0x8 /* m[1], m[12] */
|
||||
BYTE $0xd4; BYTE $0x08
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xf6 // PUNPCKHQDQ XMM14, XMM14 /* ___, m[13] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xde // PUNPCKLQDQ XMM11, XMM14 /* m[8], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 6
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X14 // X14 = m[6]+ m[7]
|
||||
MOVOU 64(DX), X15 // X15 = m[8]+ m[9]
|
||||
MOVOU X13, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6 // PUNPCKLQDQ XMM8, XMM14 /* m[2], m[6] */
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[0], m[8] */
|
||||
MOVOU 80(DX), X12 // X12 = m[10]+m[11]
|
||||
MOVOU 96(DX), X10 // X10 = m[12]+m[13]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4 // PUNPCKLQDQ XMM10, XMM12 /* m[12], m[10] */
|
||||
MOVOU X12, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdd // PUNPCKHQDQ XMM11, XMM13 /* m[11], m[3] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 48(DX), X14 // X14 = m[6]+ m[7]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce // PUNPCKHQDQ XMM9, XMM14 /* ___, m[7] */
|
||||
MOVOU X13, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1 // PUNPCKLQDQ XMM8, XMM9 /* m[4], ____ */
|
||||
MOVOU X15, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc // PUNPCKHQDQ XMM9, XMM12 /* m[15], m[1] */
|
||||
MOVOU 64(DX), X12 // X12 = m[8]+ m[9]
|
||||
MOVOU 96(DX), X10 // X10 = m[12]+m[13]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5 // PUNPCKHQDQ XMM10, XMM13 /* m[13], m[5] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xe4 // PUNPCKHQDQ XMM12, XMM12 /* ___, m[9] */
|
||||
MOVOU X15, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdc // PUNPCKLQDQ XMM11, XMM12 /* m[14], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 7
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc // PUNPCKHQDQ XMM9, XMM12 /* ___, m[1] */
|
||||
MOVOU X14, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1 // PUNPCKLQDQ XMM8, XMM9 /* m[12], ____ */
|
||||
MOVOU X15, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcd // PUNPCKLQDQ XMM9, XMM13 /* m[14], m[4] */
|
||||
MOVOU 80(DX), X11 // X11 = m[10]+m[11]
|
||||
MOVOU X13, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd7 // PUNPCKHQDQ XMM10, XMM15 /* m[5], m[15] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM11, XMM14, 0x8 /* m[13], m[10] */
|
||||
BYTE $0xde; BYTE $0x08
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X15 // X15 = m[10]+m[11]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[0], m[6] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM9, XMM14, 0x8 /* m[9], m[8] */
|
||||
BYTE $0xce; BYTE $0x08
|
||||
MOVOU 16(DX), X11 // X14 = m[2]+ m[3]
|
||||
MOVOU X13, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd3 // PUNPCKHQDQ XMM10, XMM11 /* m[7], m[3] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xff // PUNPCKHQDQ XMM15, XMM15 /* ___, m[11] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[2], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 8
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X14, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5 // PUNPCKHQDQ XMM8, XMM13 /* m[13], m[7] */
|
||||
MOVOU X12, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd4 // PUNPCKHQDQ XMM10, XMM12 /* ___, m[3] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xca // PUNPCKLQDQ XMM9, XMM10 /* m[12], ____ */
|
||||
MOVOU 0(DX), X11 // X11 = m[0]+ m[1]
|
||||
MOVOU 64(DX), X13 // X13 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU X15, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM10, XMM14, 0x8 /* m[11], m[14] */
|
||||
BYTE $0xd6; BYTE $0x08
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdd // PUNPCKHQDQ XMM11, XMM13 /* m[1], m[9] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X13, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc7 // PUNPCKHQDQ XMM8, XMM15 /* m[5], m[15] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcc // PUNPCKLQDQ XMM9, XMM12 /* m[8], m[2] */
|
||||
MOVOU 0(DX), X10 // X10 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X11 // X11 = m[6]+ m[7]
|
||||
MOVOU 80(DX), X15 // X15 = m[10]+m[11]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd5 // PUNPCKLQDQ XMM10, XMM13 /* m[0], m[4] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[6], m[10] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 9
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X13, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc7 // PUNPCKLQDQ XMM8, XMM15 /* m[6], m[14] */
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM9, XMM14, 0x8 /* m[11], m[0] */
|
||||
BYTE $0xce; BYTE $0x08
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+ m[3]
|
||||
MOVOU 64(DX), X11 // X11 = m[8]+ m[9]
|
||||
MOVOU X15, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd3 // PUNPCKHQDQ XMM10, XMM11 /* m[15], m[9] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM11, XMM13, 0x8 /* m[3], m[8] */
|
||||
BYTE $0xdd; BYTE $0x08
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+ m[3]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU 96(DX), X15 // X15 = m[12]+m[13]
|
||||
MOVOU X15, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf // PUNPCKHQDQ XMM9, XMM15 /* ___, m[13] */
|
||||
MOVOU X15, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc1 // PUNPCKLQDQ XMM8, XMM9 /* m[12], ____ */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM9, XMM12, 0x8 /* m[1], m[10] */
|
||||
BYTE $0xcc; BYTE $0x08
|
||||
MOVOU 32(DX), X12 // X12 = m[4]+ m[5]
|
||||
MOVOU 48(DX), X15 // X15 = m[6]+ m[7]
|
||||
MOVOU X15, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM11, XMM15 /* ___, m[7] */
|
||||
MOVOU X13, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd3 // PUNPCKLQDQ XMM10, XMM11 /* m[2], ____ */
|
||||
MOVOU X12, X15
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xfc // PUNPCKHQDQ XMM15, XMM12 /* ___, m[5] */
|
||||
MOVOU X12, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[4], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 1 0
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 48(DX), X13 // X13 = m[6]+ m[7]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X15 // X15 = m[10]+m[11]
|
||||
MOVOU X15, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc6 // PUNPCKLQDQ XMM8, XMM14 /* m[10], m[8] */
|
||||
MOVOU X13, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcc // PUNPCKHQDQ XMM9, XMM12 /* m[7], m[1] */
|
||||
MOVOU 16(DX), X10 // X10 = m[2]+ m[3]
|
||||
MOVOU 32(DX), X14 // X14 = m[4]+ m[5]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6 // PUNPCKLQDQ XMM10, XMM14 /* m[2], m[4] */
|
||||
MOVOU X14, X15
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xfe // PUNPCKHQDQ XMM15, XMM14 /* ___, m[5] */
|
||||
MOVOU X13, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdf // PUNPCKLQDQ XMM11, XMM15 /* m[6], ____ */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 64(DX), X13 // X13 = m[8]+ m[9]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X15, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xc5 // PUNPCKHQDQ XMM8, XMM13 /* m[15], m[9] */
|
||||
MOVOU X12, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xce // PUNPCKHQDQ XMM9, XMM14 /* m[3], m[13] */
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 80(DX), X13 // X13 = m[10]+m[11]
|
||||
MOVOU X15, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM10, XMM13, 0x8 /* m[11], m[14] */
|
||||
BYTE $0xd5; BYTE $0x08
|
||||
MOVOU X14, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xdc // PUNPCKLQDQ XMM11, XMM12 /* m[12], m[0] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 1 1
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+m[1]
|
||||
MOVOU 16(DX), X13 // X13 = m[2]+m[3]
|
||||
MOVOU 32(DX), X14 // X14 = m[4]+m[5]
|
||||
MOVOU 48(DX), X15 // X15 = m[6]+m[7]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[0], m[2] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[4], m[6] */
|
||||
MOVOU X12, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5 // PUNPCKHQDQ XMM10, XMM13 /* m[1], m[3] */
|
||||
MOVOU X14, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM11, XMM15 /* m[5], m[7] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 64(DX), X12 // X12 = m[8]+ m[9]
|
||||
MOVOU 80(DX), X13 // X13 = m[10]+m[11]
|
||||
MOVOU 96(DX), X14 // X14 = m[12]+m[13]
|
||||
MOVOU 112(DX), X15 // X15 = m[14]+m[15]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[8],m[10] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xcf // PUNPCKLQDQ XMM9, XMM15 /* m[12],m[14] */
|
||||
MOVOU X12, X10
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xd5 // PUNPCKHQDQ XMM10, XMM13 /* m[9],m[11] */
|
||||
MOVOU X14, X11
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdf // PUNPCKHQDQ XMM11, XMM15 /* m[13],m[15] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
// R O U N D 1 2
|
||||
///////////////////////////////////////////////////////////////////////////
|
||||
|
||||
// LOAD_MSG_ ##r ##_1(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_2(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 112(DX), X12 // X12 = m[14]+m[15]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 64(DX), X14 // X14 = m[8]+ m[9]
|
||||
MOVOU 96(DX), X15 // X15 = m[12]+m[13]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xc5 // PUNPCKLQDQ XMM8, XMM13 /* m[14], m[4] */
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcf // PUNPCKHQDQ XMM9, XMM15 /* m[9], m[13] */
|
||||
MOVOU 80(DX), X10 // X10 = m[10]+m[11]
|
||||
MOVOU 48(DX), X11 // X11 = m[6]+ m[7]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd6 // PUNPCKLQDQ XMM10, XMM14 /* m[10], m[8] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM11, XMM12, 0x8 /* m[15], m[6] */; ; ; ; ;
|
||||
BYTE $0xdc; BYTE $0x08
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
DIAGONALIZE
|
||||
|
||||
// LOAD_MSG_ ##r ##_3(b0, b1);
|
||||
// LOAD_MSG_ ##r ##_4(b0, b1);
|
||||
// (X12 used as additional temp register)
|
||||
MOVOU 0(DX), X12 // X12 = m[0]+ m[1]
|
||||
MOVOU 32(DX), X13 // X13 = m[4]+ m[5]
|
||||
MOVOU 80(DX), X14 // X14 = m[10]+m[11]
|
||||
MOVOU X12, X8
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x3a; BYTE $0x0f // PALIGNR XMM8, XMM12, 0x8 /* m[1], m[0] */
|
||||
BYTE $0xc4; BYTE $0x08
|
||||
MOVOU X14, X9
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xcd // PUNPCKHQDQ XMM9, XMM13 /* m[11], m[5] */
|
||||
MOVOU 16(DX), X12 // X12 = m[2]+ m[3]
|
||||
MOVOU 48(DX), X11 // X11 = m[6]+ m[7]
|
||||
MOVOU 96(DX), X10 // X10 = m[12]+m[13]
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6c; BYTE $0xd4 // PUNPCKLQDQ XMM10, XMM12 /* m[12], m[2] */
|
||||
BYTE $0x66; BYTE $0x45; BYTE $0x0f; BYTE $0x6d; BYTE $0xdc // PUNPCKHQDQ XMM11, XMM12 /* m[7], m[3] */
|
||||
|
||||
LOAD_SHUFFLE
|
||||
|
||||
G1
|
||||
G2
|
||||
|
||||
UNDIAGONALIZE
|
||||
|
||||
// Reload digest (most current value store in &out)
|
||||
MOVQ out+144(FP), SI // SI: &in
|
||||
MOVOU 0(SI), X12 // X12 = in[0]+in[1] /* row1l = LOAD( &S->h[0] ); */
|
||||
MOVOU 16(SI), X13 // X13 = in[2]+in[3] /* row1h = LOAD( &S->h[2] ); */
|
||||
MOVOU 32(SI), X14 // X14 = in[4]+in[5] /* row2l = LOAD( &S->h[4] ); */
|
||||
MOVOU 48(SI), X15 // X15 = in[6]+in[7] /* row2h = LOAD( &S->h[6] ); */
|
||||
|
||||
// Final computations and prepare for storing
|
||||
PXOR X4, X0 // X0 = X0 ^ X4 /* row1l = _mm_xor_si128( row3l, row1l ); */
|
||||
PXOR X5, X1 // X1 = X1 ^ X5 /* row1h = _mm_xor_si128( row3h, row1h ); */
|
||||
PXOR X12, X0 // X0 = X0 ^ X12 /* STORE( &S->h[0], _mm_xor_si128( LOAD( &S->h[0] ), row1l ) ); */
|
||||
PXOR X13, X1 // X1 = X1 ^ X13 /* STORE( &S->h[2], _mm_xor_si128( LOAD( &S->h[2] ), row1h ) ); */
|
||||
PXOR X6, X2 // X2 = X2 ^ X6 /* row2l = _mm_xor_si128( row4l, row2l ); */
|
||||
PXOR X7, X3 // X3 = X3 ^ X7 /* row2h = _mm_xor_si128( row4h, row2h ); */
|
||||
PXOR X14, X2 // X2 = X2 ^ X14 /* STORE( &S->h[4], _mm_xor_si128( LOAD( &S->h[4] ), row2l ) ); */
|
||||
PXOR X15, X3 // X3 = X3 ^ X15 /* STORE( &S->h[6], _mm_xor_si128( LOAD( &S->h[6] ), row2h ) ); */
|
||||
|
||||
// Store digest into &out
|
||||
MOVQ out+144(FP), SI // SI: &out
|
||||
MOVOU X0, 0(SI) // out[0]+out[1] = X0
|
||||
MOVOU X1, 16(SI) // out[2]+out[3] = X1
|
||||
MOVOU X2, 32(SI) // out[4]+out[5] = X2
|
||||
MOVOU X3, 48(SI) // out[6]+out[7] = X3
|
||||
|
||||
// Increment message pointer and check if there's more to do
|
||||
ADDQ $128, DX // message += 128
|
||||
SUBQ $1, R8
|
||||
JNZ loop
|
||||
|
||||
complete:
|
||||
RET
|
||||
@@ -22,6 +22,8 @@ func compress(d *digest, p []uint8) {
|
||||
compressAVX2(d, p)
|
||||
} else if avx {
|
||||
compressAVX(d, p)
|
||||
} else if ssse3 {
|
||||
compressSSE(d, p)
|
||||
} else {
|
||||
compressGeneric(d, p)
|
||||
}
|
||||
|
||||
13
cpuid.go
13
cpuid.go
@@ -24,8 +24,9 @@ func xgetbv(index uint32) (eax, edx uint32)
|
||||
// True when SIMD instructions are available.
|
||||
var avx2 = haveAVX2()
|
||||
var avx = haveAVX()
|
||||
var ssse3 = haveSSSE3()
|
||||
|
||||
// haveAVX returns true if when there is AVX support
|
||||
// haveAVX returns true when there is AVX support
|
||||
func haveAVX() bool {
|
||||
_, _, c, _ := cpuid(1)
|
||||
|
||||
@@ -38,7 +39,7 @@ func haveAVX() bool {
|
||||
return false
|
||||
}
|
||||
|
||||
// haveAVX2 returns true if when there is AVX2 support
|
||||
// haveAVX2 returns true when there is AVX2 support
|
||||
func haveAVX2() bool {
|
||||
mfi, _, _, _ := cpuid(0)
|
||||
|
||||
@@ -49,3 +50,11 @@ func haveAVX2() bool {
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// haveSSSE3 returns true when there is SSSE3 support
|
||||
func haveSSSE3() bool {
|
||||
|
||||
_, _, c, _ := cpuid(1)
|
||||
|
||||
return (c & 0x00000200) != 0
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user