The Galois hash algorithm (GHASH) is a fairly straight-forward keyed hash algorithm based on finite field multiplication, using the field GF(2128) with characteristic polynomial x128 + x7 + x2 + x + 1
. (An excellent treatment of Galois fields can be found here)
The significance of GHASH is that it is used as the authentication component in the GCM algorithm, which is an implementation of authenticated encryption with associated data (AEAD), a cryptographic mode that combines authentication of data sent in the clear with authentication of data that is sent in encrypted form at the same time. It is widely used these days, primarily in the networking domain (IPsec, IEEE 802.11)
ISA support
Both the Intel and ARMv8 instruction sets now contain support for carry-less multiplication (also known as polynomial multiplication), primarily to allow for accelerated implementations of GHASH to be created, which formerly had to rely on unwieldy and less secure table based implementations. (The Linux implementation pre-computes a 4 KB lookup table for each instance of the hash algorithm that is in use, i.e., for each session having a different key. 4 KB per IPsec connection does not sound too bad in terms of memory usage, but the D-cache footprint may become a bottleneck when serving lots of concurrent connections.) In contrast, implementations based on these special instructions are time invariant, and are significantly faster (around 16x on high end ARMv8 cores).
Unfortunately, though, while ARMv8 specifies a range of polynomial multiplication instructions with various operand sizes, the one we are most interested in, which performs carry-less multiplication on two 64-bit operands to produce a 128-bit result, is optional in the architecture. So on low-end cores such as the Cortex-A53 (as can be found in the Raspberry Pi 3), the accelerated driver is not available because this particular instruction is not implemented.
Using vmull.p8 to implement vmull.p64
The other day, I stumbled upon the paper Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine by Danilo Camara, Conrado Gouvea, Julio Lopez and Ricardo Dahab, which describes how 64×64 to 128 bit polynomial multiplication (vmull.p64
) can be composed using 8×8 to 16 bit polynomial multiplication (vmull.p8
) combined with other SIMD arithmetic instructions. The nice thing about vmull.p8
is that it is a standard NEON instruction, which means all NEON capable CPUs implement it, including the Cortex-A53 on the Raspberry Pi 3.
Transliterating 32-bit ARM code to the 64-bit ISA
The algorithm as described in the paper is based on the 32-bit instruction set (retroactively named AArch32), which deviates significantly from the new 64-bit ISA called AArch64. The primary difference is that the number of SIMD registers has increased to 32, which is nice, but which has a downside as well: it is no longer possible to directly use the top half of a 128-bit register as a 64-bit register, which is something the polynomial multiplication algorithm relies on.
The original code looks something like this (note the use of ‘high’ and ‘low’ registers in the same instruction)
.macro vmull_p64, rq, ad, bd vext.8 t0l, \ad, \ad, #1 @ A1 vmull.p8 t0q, t0l, \bd @ F = A1*B vext.8 \rq\()_L, \bd, \bd, #1 @ B1 vmull.p8 \rq, \ad, \rq\()_L @ E = A*B1 vext.8 t1l, \ad, \ad, #2 @ A2 vmull.p8 t1q, t1l, \bd @ H = A2*B vext.8 t3l, \bd, \bd, #2 @ B2 vmull.p8 t3q, \ad, t3l @ G = A*B2 vext.8 t2l, \ad, \ad, #3 @ A3 vmull.p8 t2q, t2l, \bd @ J = A3*B veor t0q, t0q, \rq @ L = E + F vext.8 \rq\()_L, \bd, \bd, #3 @ B3 vmull.p8 \rq, \ad, \rq\()_L @ I = A*B3 veor t1q, t1q, t3q @ M = G + H vext.8 t3l, \bd, \bd, #4 @ B4 vmull.p8 t3q, \ad, t3l @ K = A*B4 veor t0l, t0l, t0h @ t0 = (L) (P0 + P1) << 8 vand t0h, t0h, k48 veor t1l, t1l, t1h @ t1 = (M) (P2 + P3) << 16 vand t1h, t1h, k32 veor t2q, t2q, \rq @ N = I + J veor t0l, t0l, t0h veor t1l, t1l, t1h veor t2l, t2l, t2h @ t2 = (N) (P4 + P5) << 24 vand t2h, t2h, k16 veor t3l, t3l, t3h @ t3 = (K) (P6 + P7) << 32 vmov.i64 t3h, #0 vext.8 t0q, t0q, t0q, #15 veor t2l, t2l, t2h vext.8 t1q, t1q, t1q, #14 vmull.p8 \rq, \ad, \bd @ D = A*B vext.8 t2q, t2q, t2q, #13 vext.8 t3q, t3q, t3q, #12 veor t0q, t0q, t1q veor t2q, t2q, t3q veor \rq, \rq, t0q veor \rq, \rq, t2q .endm
However, things like veor t1l, t1l, t1h
or using ext
with upper halves of registers are not possible in AArch64, and so we need to transpose the contents of some of registers using the tbl
and/or zip/unzip
instructions. Also, the vmull.p8
instruction now exists in two variants: pmull
operating on the lower halves and pmull2
operating on the upper halves of the input operands.
We end up with the following sequence, which is 3 instructions longer than the original:
.macro __pmull_p8, rq, ad, bd, i .ifb \i ext t4.8b, \ad\().8b, \ad\().8b, #1 // A1 ext t8.8b, \bd\().8b, \bd\().8b, #1 // B1 ext t5.8b, \ad\().8b, \ad\().8b, #2 // A2 ext t7.8b, \bd\().8b, \bd\().8b, #2 // B2 ext t6.8b, \ad\().8b, \ad\().8b, #3 // A3 ext t9.8b, \bd\().8b, \bd\().8b, #3 // B3 ext t3.8b, \bd\().8b, \bd\().8b, #4 // B4 pmull t4.8h, t4.8b, \bd\().8b // F = A1*B pmull t8.8h, \ad\().8b, t8.8b // E = A*B1 pmull t5.8h, t5.8b, \bd\().8b // H = A2*B pmull t7.8h, \ad\().8b, t7.8b // G = A*B2 pmull t6.8h, t6.8b, \bd\().8b // J = A3*B pmull t9.8h, \ad\().8b, t9.8b // I = A*B3 pmull t3.8h, \ad\().8b, t3.8b // K = A*B4 pmull \rq\().8h, \ad\().8b, \bd\().8b // D = A*B .else tbl t4.16b, {\ad\().16b}, perm1.16b // A1 tbl t8.16b, {\bd\().16b}, perm1.16b // B1 tbl t5.16b, {\ad\().16b}, perm2.16b // A2 tbl t7.16b, {\bd\().16b}, perm2.16b // B2 tbl t6.16b, {\ad\().16b}, perm3.16b // A3 tbl t9.16b, {\bd\().16b}, perm3.16b // B3 tbl t3.16b, {\bd\().16b}, perm4.16b // B4 pmull2 t4.8h, t4.16b, \bd\().16b // F = A1*B pmull2 t8.8h, \ad\().16b, t8.16b // E = A*B1 pmull2 t5.8h, t5.16b, \bd\().16b // H = A2*B pmull2 t7.8h, \ad\().16b, t7.16b // G = A*B2 pmull2 t6.8h, t6.16b, \bd\().16b // J = A3*B pmull2 t9.8h, \ad\().16b, t9.16b // I = A*B3 pmull2 t3.8h, \ad\().16b, t3.16b // K = A*B4 pmull2 \rq\().8h, \ad\().16b, \bd\().16b // D = A*B .endif eor t4.16b, t4.16b, t8.16b // L = E + F eor t5.16b, t5.16b, t7.16b // M = G + H eor t6.16b, t6.16b, t9.16b // N = I + J uzp1 t8.2d, t4.2d, t5.2d uzp2 t4.2d, t4.2d, t5.2d uzp1 t7.2d, t6.2d, t3.2d uzp2 t6.2d, t6.2d, t3.2d // t4 = (L) (P0 + P1) << 8 // t5 = (M) (P2 + P3) << 16 eor t8.16b, t8.16b, t4.16b and t4.16b, t4.16b, k32_48.16b // t6 = (N) (P4 + P5) << 24 // t7 = (K) (P6 + P7) << 32 eor t7.16b, t7.16b, t6.16b and t6.16b, t6.16b, k00_16.16b eor t8.16b, t8.16b, t4.16b eor t7.16b, t7.16b, t6.16b zip2 t5.2d, t8.2d, t4.2d zip1 t4.2d, t8.2d, t4.2d zip2 t3.2d, t7.2d, t6.2d zip1 t6.2d, t7.2d, t6.2d ext t4.16b, t4.16b, t4.16b, #15 ext t5.16b, t5.16b, t5.16b, #14 ext t6.16b, t6.16b, t6.16b, #13 ext t3.16b, t3.16b, t3.16b, #12 eor t4.16b, t4.16b, t5.16b eor t6.16b, t6.16b, t3.16b eor \rq\().16b, \rq\().16b, t4.16b eor \rq\().16b, \rq\().16b, t6.16b .endm
On the Raspberry Pi 3, this code runs 2.8x faster than the generic, table based C code. This is a nice improvement, but we can do even better.
GHASH reduction
The accelerated GHASH implementation that uses the vmull.p64
instruction looks like this:
ext T2.16b, XL.16b, XL.16b, #8 ext IN1.16b, T1.16b, T1.16b, #8 eor T1.16b, T1.16b, T2.16b eor XL.16b, XL.16b, IN1.16b pmull2 XH.1q, XL.2d, SHASH.2d // a1 * b1 eor T1.16b, T1.16b, XL.16b pmull XL.1q, XL.1d, SHASH.1d // a0 * b0 pmull XM.1q, T1.1d, SHASH2.1d // (a1 + a0)(b1 + b0) eor T2.16b, XL.16b, XH.16b ext T1.16b, XL.16b, XH.16b, #8 eor XM.16b, XM.16b, T2.16b pmull T2.1q, XL.1d, MASK.1d eor XM.16b, XM.16b, T1.16b mov XH.d[0], XM.d[1] mov XM.d[1], XL.d[0] eor XL.16b, XM.16b, T2.16b ext T2.16b, XL.16b, XL.16b, #8 pmull XL.1q, XL.1d, MASK.1d eor T2.16b, T2.16b, XH.16b eor XL.16b, XL.16b, T2.16b
What should be noted here is that the finite field multiplication consists of a multiplication step and a reduction step, where the latter essentially performs the modulo division involving the characteristic polynomial (which is how we normalize the result, i.e., ensure that it remains inside the field)
So while this sequence is optimal for cores that implement vmull.p64 natively, we can switch to a reduction step that does not involve polynomial multiplication at all, and avoid two copies of the fallback vmull.p64
sequence consisting of 40 instructions each.
ext T2.16b, XL.16b, XL.16b, #8 ext IN1.16b, T1.16b, T1.16b, #8 eor T1.16b, T1.16b, T2.16b eor XL.16b, XL.16b, IN1.16b __pmull_p8 XH, XL, SHASH, 2 // a1 * b1 eor T1.16b, T1.16b, XL.16b __pmull_p8 XL, XL, SHASH // a0 * b0 __pmull_p8 XM, T1, SHASH2 // (a1 + a0)(b1 + b0) eor T2.16b, XL.16b, XH.16b ext T1.16b, XL.16b, XH.16b, #8 eor XM.16b, XM.16b, T2.16b eor XM.16b, XM.16b, T1.16b mov XL.d[1], XM.d[0] mov XH.d[0], XM.d[1] shl T1.2d, XL.2d, #57 shl T2.2d, XL.2d, #62 eor T2.16b, T2.16b, T1.16b shl T1.2d, XL.2d, #63 eor T2.16b, T2.16b, T1.16b ext T1.16b, XL.16b, XH.16b, #8 eor T2.16b, T2.16b, T1.16b mov XL.d[1], T2.d[0] mov XH.d[0], T2.d[1] ushr T2.2d, XL.2d, #1 eor XH.16b, XH.16b, XL.16b eor XL.16b, XL.16b, T2.16b ushr T2.2d, T2.2d, #6 ushr XL.2d, XL.2d, #1 eor T2.16b, T2.16b, XH.16b eor XL.16b, XL.16b, T2.16b
Loop invariants
Another observation one can make when looking at this code is that the vmull.p64
calls that remain all involve right hand sides that are invariants during the execution of the loop. For the version that uses the native vmull.p64
, this does not matter much, but for our fallback sequence, it means that some instructions essentially calculate the same value each time, and the computation can be taken out of the loop instead.
Since we have plenty of spare registers on AArch64, we can dedicate 8 of them to prerotated B1/B2/B3/B4 values of SHASH and SHASH2. With that optimization folded in as well, this implementation runs at 4x the speed of the generic GHASH driver. When combined with the bit-sliced AES driver, GCM performance on the Cortex-A53 increases twofold, from 58 to 29 cycles per byte.
The patches implementing this for AArch64 and for AArch32 can be found here.