GHASH for low-end ARM cores

The Galois hash algorithm (GHASH) is a fairly straight-forward keyed hash algorithm based on finite field multiplication, using the field GF(2¹²⁸) with characteristic polynomial x¹²⁸ + x⁷ + x² + x + 1. (An excellent treatment of Galois fields can be found here)

The significance of GHASH is that it is used as the authentication component in the GCM algorithm, which is an implementation of authenticated encryption with associated data (AEAD), a cryptographic mode that combines authentication of data sent in the clear with authentication of data that is sent in encrypted form at the same time. It is widely used these days, primarily in the networking domain (IPsec, IEEE 802.11)

ISA support

Both the Intel and ARMv8 instruction sets now contain support for carry-less multiplication (also known as polynomial multiplication), primarily to allow for accelerated implementations of GHASH to be created, which formerly had to rely on unwieldy and less secure table based implementations. (The Linux implementation pre-computes a 4 KB lookup table for each instance of the hash algorithm that is in use, i.e., for each session having a different key. 4 KB per IPsec connection does not sound too bad in terms of memory usage, but the D-cache footprint may become a bottleneck when serving lots of concurrent connections.) In contrast, implementations based on these special instructions are time invariant, and are significantly faster (around 16x on high end ARMv8 cores).

Unfortunately, though, while ARMv8 specifies a range of polynomial multiplication instructions with various operand sizes, the one we are most interested in, which performs carry-less multiplication on two 64-bit operands to produce a 128-bit result, is optional in the architecture. So on low-end cores such as the Cortex-A53 (as can be found in the Raspberry Pi 3), the accelerated driver is not available because this particular instruction is not implemented.

Using vmull.p8 to implement vmull.p64

The other day, I stumbled upon the paper Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine by Danilo Camara, Conrado Gouvea, Julio Lopez and Ricardo Dahab, which describes how 64×64 to 128 bit polynomial multiplication (vmull.p64) can be composed using 8×8 to 16 bit polynomial multiplication (vmull.p8) combined with other SIMD arithmetic instructions. The nice thing about vmull.p8 is that it is a standard NEON instruction, which means all NEON capable CPUs implement it, including the Cortex-A53 on the Raspberry Pi 3.

Transliterating 32-bit ARM code to the 64-bit ISA

The algorithm as described in the paper is based on the 32-bit instruction set (retroactively named AArch32), which deviates significantly from the new 64-bit ISA called AArch64. The primary difference is that the number of SIMD registers has increased to 32, which is nice, but which has a downside as well: it is no longer possible to directly use the top half of a 128-bit register as a 64-bit register, which is something the polynomial multiplication algorithm relies on.

The original code looks something like this (note the use of ‘high’ and ‘low’ registers in the same instruction)

.macro          vmull_p64, rq, ad, bd
vext.8          t0l, \ad, \ad, #1       @ A1
vmull.p8        t0q, t0l, \bd           @ F = A1*B
vext.8          \rq\()_L, \bd, \bd, #1  @ B1
vmull.p8        \rq, \ad, \rq\()_L      @ E = A*B1
vext.8          t1l, \ad, \ad, #2       @ A2
vmull.p8        t1q, t1l, \bd           @ H = A2*B
vext.8          t3l, \bd, \bd, #2       @ B2
vmull.p8        t3q, \ad, t3l           @ G = A*B2
vext.8          t2l, \ad, \ad, #3       @ A3
vmull.p8        t2q, t2l, \bd           @ J = A3*B
veor            t0q, t0q, \rq           @ L = E + F
vext.8          \rq\()_L, \bd, \bd, #3  @ B3
vmull.p8        \rq, \ad, \rq\()_L      @ I = A*B3
veor            t1q, t1q, t3q           @ M = G + H
vext.8          t3l, \bd, \bd, #4       @ B4
vmull.p8        t3q, \ad, t3l           @ K = A*B4
veor            t0l, t0l, t0h           @ t0 = (L) (P0 + P1) << 8
vand            t0h, t0h, k48
veor            t1l, t1l, t1h           @ t1 = (M) (P2 + P3) << 16
vand            t1h, t1h, k32
veor            t2q, t2q, \rq           @ N = I + J
veor            t0l, t0l, t0h
veor            t1l, t1l, t1h
veor            t2l, t2l, t2h           @ t2 = (N) (P4 + P5) << 24
vand            t2h, t2h, k16
veor            t3l, t3l, t3h           @ t3 = (K) (P6 + P7) << 32
vmov.i64        t3h, #0
vext.8          t0q, t0q, t0q, #15
veor            t2l, t2l, t2h
vext.8          t1q, t1q, t1q, #14
vmull.p8        \rq, \ad, \bd           @ D = A*B
vext.8          t2q, t2q, t2q, #13
vext.8          t3q, t3q, t3q, #12
veor            t0q, t0q, t1q
veor            t2q, t2q, t3q
veor            \rq, \rq, t0q
veor            \rq, \rq, t2q
.endm

However, things like veor t1l, t1l, t1h or using ext with upper halves of registers are not possible in AArch64, and so we need to transpose the contents of some of registers using the tbl and/or zip/unzip instructions. Also, the vmull.p8 instruction now exists in two variants: pmull operating on the lower halves and pmull2 operating on the upper halves of the input operands.

We end up with the following sequence, which is 3 instructions longer than the original:

.macro          __pmull_p8, rq, ad, bd, i
.ifb            \i
ext             t4.8b, \ad\().8b, \ad\().8b, #1         // A1
ext             t8.8b, \bd\().8b, \bd\().8b, #1         // B1
ext             t5.8b, \ad\().8b, \ad\().8b, #2         // A2
ext             t7.8b, \bd\().8b, \bd\().8b, #2         // B2
ext             t6.8b, \ad\().8b, \ad\().8b, #3         // A3
ext             t9.8b, \bd\().8b, \bd\().8b, #3         // B3
ext             t3.8b, \bd\().8b, \bd\().8b, #4         // B4

pmull           t4.8h, t4.8b, \bd\().8b                 // F = A1*B
pmull           t8.8h, \ad\().8b, t8.8b                 // E = A*B1
pmull           t5.8h, t5.8b, \bd\().8b                 // H = A2*B
pmull           t7.8h, \ad\().8b, t7.8b                 // G = A*B2
pmull           t6.8h, t6.8b, \bd\().8b                 // J = A3*B
pmull           t9.8h, \ad\().8b, t9.8b                 // I = A*B3
pmull           t3.8h, \ad\().8b, t3.8b                 // K = A*B4
pmull           \rq\().8h, \ad\().8b, \bd\().8b         // D = A*B
.else
tbl             t4.16b, {\ad\().16b}, perm1.16b         // A1
tbl             t8.16b, {\bd\().16b}, perm1.16b         // B1
tbl             t5.16b, {\ad\().16b}, perm2.16b         // A2
tbl             t7.16b, {\bd\().16b}, perm2.16b         // B2
tbl             t6.16b, {\ad\().16b}, perm3.16b         // A3
tbl             t9.16b, {\bd\().16b}, perm3.16b         // B3
tbl             t3.16b, {\bd\().16b}, perm4.16b         // B4

pmull2          t4.8h, t4.16b, \bd\().16b               // F = A1*B
pmull2          t8.8h, \ad\().16b, t8.16b               // E = A*B1
pmull2          t5.8h, t5.16b, \bd\().16b               // H = A2*B
pmull2          t7.8h, \ad\().16b, t7.16b               // G = A*B2
pmull2          t6.8h, t6.16b, \bd\().16b               // J = A3*B
pmull2          t9.8h, \ad\().16b, t9.16b               // I = A*B3
pmull2          t3.8h, \ad\().16b, t3.16b               // K = A*B4
pmull2          \rq\().8h, \ad\().16b, \bd\().16b       // D = A*B
.endif

eor             t4.16b, t4.16b, t8.16b                  // L = E + F
eor             t5.16b, t5.16b, t7.16b                  // M = G + H
eor             t6.16b, t6.16b, t9.16b                  // N = I + J

uzp1            t8.2d, t4.2d, t5.2d
uzp2            t4.2d, t4.2d, t5.2d
uzp1            t7.2d, t6.2d, t3.2d
uzp2            t6.2d, t6.2d, t3.2d

// t4 = (L) (P0 + P1) << 8
// t5 = (M) (P2 + P3) << 16
eor             t8.16b, t8.16b, t4.16b
and             t4.16b, t4.16b, k32_48.16b

// t6 = (N) (P4 + P5) << 24
// t7 = (K) (P6 + P7) << 32
eor             t7.16b, t7.16b, t6.16b
and             t6.16b, t6.16b, k00_16.16b

eor             t8.16b, t8.16b, t4.16b
eor             t7.16b, t7.16b, t6.16b

zip2            t5.2d, t8.2d, t4.2d
zip1            t4.2d, t8.2d, t4.2d
zip2            t3.2d, t7.2d, t6.2d
zip1            t6.2d, t7.2d, t6.2d

ext             t4.16b, t4.16b, t4.16b, #15
ext             t5.16b, t5.16b, t5.16b, #14
ext             t6.16b, t6.16b, t6.16b, #13
ext             t3.16b, t3.16b, t3.16b, #12

eor             t4.16b, t4.16b, t5.16b
eor             t6.16b, t6.16b, t3.16b
eor             \rq\().16b, \rq\().16b, t4.16b
eor             \rq\().16b, \rq\().16b, t6.16b
.endm

On the Raspberry Pi 3, this code runs 2.8x faster than the generic, table based C code. This is a nice improvement, but we can do even better.

GHASH reduction

The accelerated GHASH implementation that uses the vmull.p64 instruction looks like this:

ext		T2.16b, XL.16b, XL.16b, #8
ext		IN1.16b, T1.16b, T1.16b, #8
eor		T1.16b, T1.16b, T2.16b
eor		XL.16b, XL.16b, IN1.16b

pmull2		XH.1q, XL.2d, SHASH.2d		// a1 * b1
eor		T1.16b, T1.16b, XL.16b
pmull	 	XL.1q, XL.1d, SHASH.1d		// a0 * b0
pmull		XM.1q, T1.1d, SHASH2.1d		// (a1 + a0)(b1 + b0)

eor		T2.16b, XL.16b, XH.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		XM.16b, XM.16b, T2.16b

pmull		T2.1q, XL.1d, MASK.1d
eor		XM.16b, XM.16b, T1.16b

mov		XH.d[0], XM.d[1]
mov		XM.d[1], XL.d[0]

eor		XL.16b, XM.16b, T2.16b
ext		T2.16b, XL.16b, XL.16b, #8
pmull		XL.1q, XL.1d, MASK.1d

eor		T2.16b, T2.16b, XH.16b
eor		XL.16b, XL.16b, T2.16b

What should be noted here is that the finite field multiplication consists of a multiplication step and a reduction step, where the latter essentially performs the modulo division involving the characteristic polynomial (which is how we normalize the result, i.e., ensure that it remains inside the field)

So while this sequence is optimal for cores that implement vmull.p64 natively, we can switch to a reduction step that does not involve polynomial multiplication at all, and avoid two copies of the fallback vmull.p64 sequence consisting of 40 instructions each.

ext		T2.16b, XL.16b, XL.16b, #8
ext		IN1.16b, T1.16b, T1.16b, #8
eor		T1.16b, T1.16b, T2.16b
eor		XL.16b, XL.16b, IN1.16b

__pmull_p8	XH, XL, SHASH, 2		// a1 * b1
eor		T1.16b, T1.16b, XL.16b
__pmull_p8 	XL, XL, SHASH			// a0 * b0
__pmull_p8	XM, T1, SHASH2			// (a1 + a0)(b1 + b0)

eor		T2.16b, XL.16b, XH.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		XM.16b, XM.16b, T2.16b

eor		XM.16b, XM.16b, T1.16b

mov		XL.d[1], XM.d[0]
mov		XH.d[0], XM.d[1]

shl		T1.2d, XL.2d, #57
shl		T2.2d, XL.2d, #62
eor		T2.16b, T2.16b, T1.16b
shl		T1.2d, XL.2d, #63
eor		T2.16b, T2.16b, T1.16b
ext		T1.16b, XL.16b, XH.16b, #8
eor		T2.16b, T2.16b, T1.16b

mov		XL.d[1], T2.d[0]
mov		XH.d[0], T2.d[1]

ushr		T2.2d, XL.2d, #1
eor		XH.16b, XH.16b, XL.16b
eor		XL.16b, XL.16b, T2.16b
ushr		T2.2d, T2.2d, #6
ushr		XL.2d, XL.2d, #1

eor		T2.16b, T2.16b, XH.16b
eor		XL.16b, XL.16b, T2.16b

Loop invariants

Another observation one can make when looking at this code is that the vmull.p64 calls that remain all involve right hand sides that are invariants during the execution of the loop. For the version that uses the native vmull.p64, this does not matter much, but for our fallback sequence, it means that some instructions essentially calculate the same value each time, and the computation can be taken out of the loop instead.

Since we have plenty of spare registers on AArch64, we can dedicate 8 of them to prerotated B1/B2/B3/B4 values of SHASH and SHASH2. With that optimization folded in as well, this implementation runs at 4x the speed of the generic GHASH driver. When combined with the bit-sliced AES driver, GCM performance on the Cortex-A53 increases twofold, from 58 to 29 cycles per byte.

The patches implementing this for AArch64 and for AArch32 can be found here.